Third Year Project - Final Report

Catherine Hunt chh398@soton.ac.uk

Self-Archiving Practices: Online Indicators and Self-Report

1st Supervisor: Professor Stevan Harnad 2nd Supervisor: Dr. Andy Gravell

Abstract

Johannes Gutenberg's printing advances in the 15th century loosened the controls on publishing by broadening and democratising information content and access, and ultimately helped to promote the cultural and scientific upheavals that followed. The Internet- and the World Wide Web in particular- represent another step in this ongoing revolution. Online archives on the World Wide Web (WWW) may eventually supersede the original practice of scholars publishing research in paper journals. In some areas of research, the preliminary stages involved in writing a paper are already making use of the WWW to make research findings public.

Before a paper is published in a journal an early version is submitted to a journal editor who sends it to experts for review. These peers then recommend accepting, rejecting or revising the paper for publication. Papers may be returned to the author for revision, and the submission process repeated until the paper is published.

In both the pre-refereeing 'pre-prints' and the final published 'post-prints', the author cites other authors' work that has been used in the research. If a reader wishes to look at a cited piece of work, that paper has to be found elsewhere.

The Los Alamos Eprint Archive (arXiv) was set up as an online open archive on the WWW, to house papers in Physics. The idea evolved from the practice of authors emailing pre-prints of their papers to peers for informal feedback. This idea was expanded in the form of the arXiv. Authors can put draft copies of the paper up for public view and update them accordingly, until the final, peer-reviewed published version appears.

The Open Citation-linking Project (OpCit) is already investigating the practices of arXiv users and authors.

My aim is to extend the ongoing research by investigating the relation between the objective online indicators and the authors' own verbal reports of their practices and rationale in archiving their work. Acknowledgements

I would like to express my gratitude to my family who have helped me through the good times and the bad, my tutor, Professor Stevan Harnad for keeping everything in perspective, and Tim Brody for walking me through the technical aspects of my studies. Without any of them, this work would not have been possible.

Contents

1. .3

2. Background and Literature Search4

3. Report on Technical Progress....5

4. Plan and Assessment of Remaining Work.8

5. Bibliography10

6. Appendices...11

6.1. Survey of Users of arXiv..12

6.2. Survey of Non-Users of arXiv..13

6.3. Survey of Users of Cogprints14

6.4. Survey of Non-Users of Cogprints...15

Introduction

This report has been compiled to detail all work carried out from October 2000 May 2001 for the final third year project. The original project brief, compiled in October 2000 can be found in appendix 1. This section, however, aims to give more detail than the brief about how exactly ideas for the project started and evolved.

With the advent of the World Wide Web (WWW), it seemed inevitable that most individuals and businesses would use the Internet as a wider form of dissemination of their wares. Indeed, from individuals wishing to use this information superhighway as a means of making holiday snaps available to friends on their computers, to multi-national companies making their business even more accessible to other companies, it is in no doubt that the WWW has superseded the more traditional methods of distribution.

The Internet has touched most individuals who have access to a computer, and those who are expert users and those who wish to learn are proficient or dabbling in making work available to other computer users. It is possibly the easiest and fastest way of finding information without even having to leave the house. Information is available from the desktop computer.

One particular area of the WWW that is considered relatively new is the field of the digital library. For the first time in history it is possible to build large-scale services where collections of information are stored in digital formats and retrieved over networks. The digital library is bringing together facets of many disciplines and experts in different backgrounds with different approaches.

This is a fundamental period in the history of libraries and publishing, as technology is advancing to the point where in a completely digital library, nothing ever need reach paper. In other words, the digital library brings the library to the user. Since the 1980s, scientists and librarians have predicted the use of networks to disseminate information. Two factors that are having a great influence on the future of the concept are the rate at which these collections will become available on the web and what the economic balance will be, with some collections offering open access and others paid directly by their users. It is not yet certain what this balance will be .

Despite the fact that digital libraries are only really coming to the forefront now, with the explosion of the World Wide Web (WWW), the vision is not a new one. In 1945, an article was published by Vannevar Bush called As We May Think . In it, he commented that methods of the time for transmitting and reviewing the results of research were old and by now are totally inadequate for their purpose. He went on to discuss technological advances and provided an outline for one possible approach for storage of information in what he called Memex. The Memex design used photography to store information and for many years, microfilm was the technology perceived as the most suitable for storing information cheaply.

The first serious attempt to store information on computers was in the late 1960s . The early systems were no threat to the printed document as only unformatted text could be displayed in a fixed-spaced font without diagrams, mathematics or the graphic quality that makes reading easy. Add these factors to the monitors that were used, with low resolution and poor contrast; it was not easy to convince people to read from a screen. It is in the past thirty years that we have seen the technological barriers being broken down. In the early 1990s, it was the lowering costs of computing and the explosion of online services that meant we started to see the development of digital libraries.

There is a branch in the area of the digital library within the WWW known as on-line archiving that has served as the scope of focus for this project. Online archives may eventually supersede the original practice of scholars publishing research in paper journals. In some areas of research, the preliminary stages involved in writing a paper are already making use of the WWW to make research findings public.

These archives are still in their infancy, having started to appear around ten years ago (e.g., Ginsparg's preprint and archive database developed at the Los Alamos National Laboratory http://ariv.org/, and more recently, Harnads Psycoloquy http://www.cogsci.soton.ac.uk/psycoloquy/). Yet, despite the explosion of the WWW, they are still only being used by a minority of authors (30-40%), and only in a very few fields (mostly Physic, Mathematics and Computer Science). Some authors appear to still prefer the traditional paper journal as the sole method of disseminating their research.

Before a paper is published in a journal, an early version is submitted to a journal editor who sends it to experts for review. These peers then recommend accepting, rejecting or revising the paper for publication. Papers may be returned to the author for revision; the submission process is repeated until the paper is published (or rejected). However, this process is timely, and according to Harnad:

turnaround times were still slow and uncertain: The requested reprints were a long time in coming, if they came at all, in the non-first world; and even in the first, a lot of time and resources were wasted retrieving reprints. and even the relevant hits had to be committed to growing, groaning reprint shelves in labs, which eventually created storage, navigation and retrieval problems of their own.

The on-line archives evolved from the practice of authors emailing pre-prints of their papers to peers for informal feedback. With the archives, authors can deposit their pre-refereed work (pre- prints) and published work (post-prints) into an archive for all to see. The cost and sheer number of refereed journals that exist (over 20,000 according to Bowkers), it is simply not possible for libraries and institutions to subscribe to more than a small portion of all the papers published every year. For those trying to locate papers, the process can often be a long, laborious and often fruitless one.

According to Steven Bachrach et al:

Electronic communication has created new ways to distribute such results and is forcing researchers and publishers to reassess the old procedures and consider new possibilities as we learn to use the Internet. Now, not only can authors easily disseminate their results, but networked readers can have cheap, fast access to more scientific literature and have it in a form that facilitates its use in their own research.

With on-line archives, all papers can be located by anyone quickly and easily -- and at no cost. Authors can put draft copies and successive updates up for public view, until the final, peer- reviewed (published) version appears. Users can follow the research through all of its successive stages from pre-prints through to the post-prints.

In both the pre-refereeing pre-prints and the final published post-prints, the author cites other authors' work that has been used in the research. If a reader wishes to look at a cited piece of work, that paper has to be found elsewhere. This is time-consuming, and often ends with the paper being inaccessible, because the user's institution cannot afford to subscribe to it. On-line archives are developing ways of linking papers to all the papers they cite (OpCit Project). The user only needs to click on the citation -- as long as it too is archived online.

The Los Alamos Eprint Archive (arXiv) was set up as an online archive on the WWW, to house research literature in Physics. The first of its kind, arXiv started ten years ago in 1991 and has been expanding ever since. It has over 130,000 papers deposited to date.

Cogprints was set up in 1997 to archive papers in the areas of Psychology, Neuro-Science, Linguistics, Computer Science, Philosophy and Biology. It has approximately 1000 papers deposited to date.

The Open Citation-linking Project (OpCit) is already investigating the practices of arXiv users and authors. It is currently developing tools to make the existing resources more powerful by completely citation inter-linking all of the papers in arXiv and eventually to extend this to all the rest of the disciplines in other open archives . For the purposes of the third year project this investigation has been extended to investigate a previously unexplored avenue. A fundamental part of the usage of an on-line archive is the habits of the users themselves. The aim has been to extend the ongoing research by investigating the relation between the objective online indicators and the authors' own verbal reports of their practices and rationale in archiving their work.

Particular questions that were asked in the early stages of the project were: -

? Why do authors use the archive?

? Do some areas of Physics archive or use the archive more than others?

? At what point does an author decide to archive a draft?

? Do authors cite pre-prints, published post-prints, or both, and under what conditions?

? When an author cites a pre-print, do they update the citation when the cited paper is revised or published?

? If a paper is eventually accepted by a journal, does the author update the text of the paper or just the reference information?

? What are the authors' practices in archiving successive drafts?

? What are the authors' practices in citing successive drafts?

? What is the relationship between the impact factor of an author and download frequency and other online performance indicators and practices?

Speculative reasons were presented as to why physicists use on-line archives more than other fields of research. The reasons are as follows: -

1. Physicists have a much stronger pre-print culture ; they had been mailing or emailing copies of their work to one another well before arXiv began, and arXiv was simply a more efficient way of continuing this practice.

2. Physicists work within a teX culture. They all write their documents in this format making it easier to circulate information. Physicists may rely more on each other's work, whereas in Computer Science, fewer papers are written and the development in the field does not depend so closely on earlier papers in the subject area.

3. Physicists are more serious about their research and the speed in which they carry it out is important, making archiving the obvious practice to use.

4. The difference between un-refereed pre-prints and refereed post-prints is minimal.

And, indeed the question of why authors do not use archives needs to be addressed. The issue of copyright of papers belonging to the journal may be a reason for not everyone embracing the idea of on-line archives. According to Steven Bachrach et al:

It is expected that some journal publishers will feel threatened by so fundamental a change in ownership practice. The most important concern for publishers and authors alike is that the Internet enables anyone to create new electronic publishing means. Such new distribution outlets may well overtake traditional publishing institutions, particularly when those institutions fail to keep up with the evolving needs of a scientific community.

Perhaps it is the traditional journals and their policies that stop authors experimenting with new technology? Whether it has anything to do with the journals policies or not, it seems to be the view of some that the journals will likely disappear within 10 to 20 years, and Publications delays will disappear, and reliability of the literature will increase with opportunities to add comments to papers and attach references to later works that cite them. The use of electronic forms of scholarly information has typically been growing at 50 to 100 percent per year , so it will be interesting to see if these factors are reflected in the questionnaire responses.

This empirical research has been carried out with the use of questionnaires submitted to the users of on-line archives, to return measures of the views of both the authors and the users of the archives.

Original Research Carried out for the OpCit Project

The Open Citation Linking Project (OpCit) is a funded project being carried out within the university, currently developing tools to make the existing resources more powerful by completely citation inter-linking all of the papers in The Los Alamos Eprint Archive (arXiv) and eventually to extend this to all the rest of the disciplines in other open archives. The project has already seen other researchers, Tim Brody and Ian Hickman investigating the relationships between user practices and on-line indicators. However, the research focussed on the citation impact of papers and the usage patterns of arXiv, rather than the patterns of the users themselves.

The first decision that needed to be made was how to go about empirical research. The questions that had already been asked in the project brief were leaning towards the extraction of personal views from both the authors and readers of E-Print archives. It seemed an obvious choice that the easiest method of getting views was to design a questionnaire. This would produce regular set of results that could be compared with relative ease.

To question the maximum number of authors and readers, following previous discussion of the WWW, the logical solution was to design a web-based questionnaire to be used on the Internet.

The second decision was who to target with the questionnaires. It was paramount that they reached a set of people that were both users and non-users of on-line archives, and that they were either authors and/or readers of academic papers. For instance, it would be of little use to target someone who does not read or write academic papers.

Each decision is fully documented in the following sections.

The Target Set

The original plan specified sampling authors and readers of academic papers in on-line archives. However, it was ascertained that this set of people would not necessarily encompass non-users of on-line archives. The decision was made to separate the sample set into users and non-users, both as archivers and users. This way, the question of why and why not people use these archives would be answered, and hopefully give a reason why archives have not yet superseded the idea of paper journals.

Again, the original plan specified that the set of people targeted would be the users of arXiv. With the change in decision of who to target, this would now encompass both users and non-users of arXiv. However, this would not answer the question of why physicists have a much stronger pre- print culture than other areas of research. Thus, to make the results much broader and more impartial the decision was made to also direct the questionnaires at the users and non-users of Cogprints. (Cogprints was set up in 1997 to archive papers in the areas of Psychology, Neuro- Science, Linguistics, Computer Science, Philosophy and Biology. It has approximately 1000 papers deposited to date.)

With the inclusion of a new set of people, this would now also answer the question of why the researchers from a Computer Science background are slower than Physicists to embrace this new technology even though they have had the more direct hand on developing it.

With four subsets of people to question; arXiv users, arXiv non-users, Cogprints users and Cogprints non-users; the overall results would be made subjective by exploring the habits of researchers in a different subject areas.

How to Reach the Target Set

The next task was to make sure that people from the target sets would somehow have access to the relevant questionnaire. A number of avenues were explored to find the most appropriate method of getting them to see the questionnaires.

Research that was carried out previously for the OpCit project involved gaining permission from arXiv to target one hundred and thirty four authors who used the archive to deposit papers, using their email address. This resulted in thirty-four responses: a 27% success rate. This was only a small subset of arXiv authors. A larger population was required to get a representative sample of archiving and usages practices. As the new survey required a greater number of responses, and the non-users of archives could not be located so easily, a decision was made to put a link button on as many websites that authors and readers of academic might visit as possible. (A link is an area on an Internet web page that if clicked, takes the user to another specified web page. In this case, the user was alerted to the fact there was a survey taking place that might be of interest of them. Clicking on the relevant link took them to a page offering a choice of the four questionnaires.)

The starting point was to request a link on the arXiv main web site in Los Alamos, followed by links on the mirror sites around the world. Then a link was requested on the Cogprints web site. Both were successfully granted. Thereafter, links were requested and placed on the following web sites: -

It should be noted that all of the web sites are areas that both readers and authors of academic papers are highly likely to use.

The questionnaires were deposited in the authors university file space allowing access from anyone with a link to them.

The Questions

There were four main considerations in planning the questionnaires:

? Questions should generate valid and reliable information on the matter being surveyed.

? Respondents must find the questions comprehensible.

? Questions likely to produce biased answers should be avoided.

? The questionnaire should be designed with the subsequent data analysis in mind.

It was accordingly decided that (i) the questions should be short and unambiguous, (ii) they should fall into a logical sequence and (iii) leading questions should be avoided.

The process of designing the questions was an important element in the design process. The above points were taken into account, and the questions were written and rewritten on many occasions to be completely unbiased and offer options that the user could chose to cover all eventualities. The users of the questionnaires were offered the chance to give hand typed comments throughout the questionnaire to make their opinions absolutely clear, thus completely avoiding un-intentional bias.

As there were four different sets being sampled, there needed to be four sets of questions designed. It was decided that there should be two main questionnaires designed initially one for users of the archives, and one for non-users of archives. In order to make the comparison of results easy and impartial, the questionnaires needed to be designed in parallel. The questions were designed so that the target sets were asked the same questions, except in circumstances where a question was completely irrelevant. In the case of questions not running in parallel, another relevant question was asked. For example, questions 1 5 are the same in all questionnaires, however, in question 6 a different question is used for users and non-users, and then questions 7 14 are the same in all of the questionnaires.

The obvious choice was to run all questions in parallel and then include the odd questions at the end. However, the original decision was to keep the questions in logical sequence, making the questionnaires as easy to use as possible.

To make the questionnaires complete, text was composed to introduce the user to why the survey was being carried out, and relevant contact information. Each user of the questionnaires had an option to say who they were, and offer further comments in connection with the questionnaire.

The two master sets of questions users and non-users can be found in appendix 2. The four final individual questionnaires can be visited at: -

arXiv Users http://www.ecs.soton.ac.uk/~chh398/Questionnaires/arXivusers.php3 arXiv Non-Users http://www.ecs.soton.ac.uk/~chh398/Questionnaires/arXivnonusers.php3 Cogprints Users http://www.ecs.soton.ac.uk/~chh398/Cogprintsusers.php3 Cogprints Non-Users http://www.ecs.soton.ac.uk/~chh398/Cogprintsnonusers.php3

Alternatively, the final questionnaires can be found on the floppy disk marked.. at the back of this report.

The Questionnaires Technical Aspects

Once decisions had been made about the questions to be asked, and in which way they would be transcribed, how to actually carry out the task was the next problem.

The questionnaires were to be web-based. Two options of how to set them up were considered; HyperText Markup Language (HTML) and Microsoft Access (MS Access). MS Access offers functionality including wizards to automatically design web forms templates using a database set up by the user. Having little knowledge of MS Access, the decision was made to avoid this option, as time was in short supply.

Having used HTML previously, it seemed the natural solution for setting up the questionnaires. It is an easy to learn mark-up language that offers complete control and freedom when considering the aesthetics of the web pages and can be linked to scripts that will process the results from web forms. For these reasons, HTML was the chosen format for the questionnaires.

Following consultation with Tim Brody, who had already designed a basic questionnaire for research with the OpCit project, the design provided a starting point for learning how to create the four new questionnaires. Basic techniques for creating web-based questionnaires were explored, including how to create form fills, radio buttons and drop-down menus. Two books on HTML were used to learn to master these techniques

The four final versions of the questionnaires made use of option boxes that users could scroll through if there were a large number of choices, comment boxes that offered the users the chance to type in personal comments, and drop-down menus that offered a larger choice of options if the question required it.

The four questionnaires were made uniform, for the reasons discussed earlier in the report. This facilitated the data processing and comparisons. At this stage, there were essentially two basic layouts, one for the users of both archives and the other for the non-users. The wording of the titles and explanations was different according to the target set.

Many revisions of the questionnaires were made to ensure that the questions were worded correctly and readily understandable.

Figure 1 shows a small amount of the raw HTML used for the questionnaires.

2.1 Do you self-archive your papers in arXiv?



2.2 If yes, since when have you been self-archiving in arXiv?

Figure 1. Example of HTML used in the questionniares

Each question was individually tagged giving it a unique reference so that when the results were processed, they could be easily identified. There are a number of options for each question, or an optional text field. The chosen option value or text is matched to the tag.

One particular problem that was encountered was how to create the drop-down option menus for questions such as question 3.1 that asked how many papers an author had written in their lifetime. The answer could conceivably be anything from zero to one thousand. To offer an option box with one thousand and one choices and corresponding tags would have taken too much time, so a solution had to be found.

A second problem was how to tag the responses numerically. The original plan was to use the decimal number system. However, after testing this system, it was discovered that all tags starting with 1 were listed first, then those starting with 2, resulting in 1, 11, 12, 13,., 19, 20,21,22Obviously, this resulted in the questions becoming muddled and made the results confusing. Using the binary number system to force the results into correct numerical order solved the problem. Thus, the tags look like: - 001, 002, 003, ., 011, 012, 013,., 020, 021

Following consultation with peers, the suggestion was made to use php. This entailed writing a small piece of code within the HTML that would dynamically produce all of the options with tags to match. The php code used is seen in Figure 2.

Question 1

1.1 What are your areas of specialization?
(If using Windows: use control key to selct multiple options. Using Linux: click multiple.)



1.2 Other (please specify):


Question 2

2.1 Do you self-archive your papers in arXiv?



2.2 If yes, since when have you been self-archiving in arXiv?



Question 3

3.1 Please estimate the total number of papers that you have written to date.
$counter\n" ); } ?>

3.3 Please estimate the average number of papers you write per year?
$counter\n" ); } ?>


Question 4

4.1 Why did you start self-archiving? (Please indicate how you found out about the possibility and why you decided to do it.)



4.2 How do you get feedback about your pre-refereeing drafts?



4.3 Comments. (If none of these apply, please elaborate).


Question 5

5.1 Have you archived your papers anywhere other than arXiv?



5.2 If yes, where are they archived?


5.3 If yes, since when have you been self-archiving there?


5.4 Please leave blank.


5.5 If you do not self-archive, we are very interested to hear as full as possible an inventory of your reasons for not self-archiving.


Question 6

How many papers have you archived in arXiv?



Question 7

What percentage of all your current papers do you archive in arXiv?



Question 8

8.1 What percentage of your current pre-refereeing preprints do you archive?



8.2 If you do archive these preprints, please explain why.


8.3 If you do not archive these preprints, please explain why not.


Question 9

9.1 Of the intermediate revised drafts of your current pre- refereeing preprints, what percentage do you archive?



9.2 Please indicate your average number of intermediate revised drafts per paper:



9.3 If you do archive these revised drafts, why?


9.4 If you do not archive these revised drafts, why not?


Question 10

10.1 How important do you think it is to update the reference (journal, volume, date) of your archived preprint to match the published version once the refereed draft is published?



10.2 Please give reasons why you think it is important, if important.

10.3 Please give reasons why you think it is not important, if not important.


Question 11

11.1 What percentage of your current refereed final drafts ("postprints") do you archive?



11.2 If you do archive these drafts, why?


11.3 If you do not archive these drafts, why not?


Question 12

12.1 What percentage of your revised updates or corrections of published final drafts ("post-postprints") do you archive?



12.2 Please estimate your average number of revised updates or corrections per paper:



12.3 If you do archive these updates or corrections, why?


12.4 If you do not archive these updates or corrections, why not?


Question 13

13.1 What percentage of your papers do you archive retrospectively (i.e., papers you wrote before you started self-archiving)?



13.2 Please explain why you archive retrospectively, if you do.


13.3 Please explain why you do not archive retrospectively, if you do not.


Question 14

Please estimate what percentage of the current papers you read are:-

14.1 (U) unrefereed preprints versus (R) refereed postprints.


14.2 (OL) on-line versus (OP) on-paper.


Question 15

Please estimate the percentage of online papers that you:-

15.1 Only skim and browse on-screen.


15.2 Down-load and skim/browse.


15.3 Read fully on-screen.


15.4 Print out and read fully on paper.


15.5 Please comment on your practices.


Question 16

Please estimate what percentage of the current papers you cite are:-

16.1 (U) unrefereed preprints versus (R) refereed postprints.


16.2 (OL) on-line versus (OP) on-paper.


Question 17

In your experience, how much difference in content is there between your unrefereed preprints and your refereed postprints (i.e., how much substantive change results from the refereeing)?

17.1 Preprint-to-postprint difference in the form of your paper:-


17.2 Preprint-to-postprint difference in the content of your paper:-


17.3 Potential importance of the difference between the preprint and the postprint to the reader:-


17.4 Please compare the case of your own papers with the case of papers by other researchers whose work you read and use.


Question 18

As a user, what is your relative reliance on unrefereed preprints (U) versus refereed postprints (R) in reading, citing, and trying to build on the work.

18.1 Reading:-


18.2 Citing:-


18.3 Trying to build on the work:-


18.4 Please explain your thoughts and practises, and state any reasons that come to mind.


Question 19

As a reader, what proportion of your reading of research papers is from:-

19.1 Free on-line archives like arXiv.


19.2 Fee-based journal on-line archives.


19.3 On-paper journals (or on-paper preprints/reprints))?


19.4 Please explain your practices (e.g., availability, reliability, accessibility, cost, authentication).


Question 20

20.1 Do you have any concerns about priority problems arising from your self-archiving?



20.2 Please explain your concerns, if any.


Question 21

21.1 Do you have any concerns about plagiarism problems arising from your self-archiving?


21.2 Please explain your concerns, if any.


Question 22

22.1 Do you have any concerns about journal copyright problems arising from your self-archiving?



22.2 Please explain your view, if any.


Question 23

23.1 Do you have any concerns about journal embargo policy arising from your self-archiving? (Some journals, such as Science, consider online self-archiving as "prior publication" and say they will not referee such papers.)



23.2 Please explain your concerns, if any.


Question 24

As an archive user, do you prefer to be automatically alerted of new archived papers in your interest area, or do you prefer to search the archive when you choose to, or both?



Question 25

What effect do you think archiving has on the impact of your research in the following areas:-

25.1 Visibility of your work.


25.2 Citation of your work.


25.3 Replication/application of your work.


25.4 Influencing further work of others.


25.5 Official recognition of your work.


25.6 Immediacy of your work.


25.7 Please elaborate any further comments that come to mind.


Question 26

Please add any further comments that came to mind in connection with any topic in this questionnaire.


Question 27

All these responses will be kept confidential and anonymous. If you are willing to allow me to contact you in case there is any need for clarification of some of your repsonses, please enter your name and Email address. Your data will still remain completely confidential and anonymous.

Name (optional):

Email (optional):




About this Survey

Why are we conducting this survey?

We are analyzing the practices of authors who do and do not use the Physics
arXiv. The findings will be used to suggest potential enhancements of the services as well as to get a deeper understanding of the very rapid developments in the on-line dissemination and use of scientific and scholarly research.

We would be very grateful if you would help us by providing critical information that cannot be derived from the automatized archive data (e.g., what proportion of your papers [if any] you have deposited in the archive, what your updating policy is, what your practices are in citing other authors' work -- and what your reasons are for not archiving, etc., if you do not).

We are also anxious to hear any comments, questions, suggestions or criticisms you may have; please either email me, Cathy Hunt, or leave a comment on the form.

The raw data that you send will be stored in a private database on this machine (fluffy.ecs.soton.ac.uk). This database is not accessible to the outside world.

Who am I?

I am Cathy Hunt, a research student at the University of Southampton working under the supervision of Professor Stevan Harnad.

We are surveying users and non-users of arXiv and/or "Cogprints" (Computer Science, Biology, Psychology, Linguistics and Neuroscience).

I didn't ask to be surveyed, why am I receiving this email?

I hope that you do not mind our contacting you like this. If you do not wish to participate, please either let us know or just discard this request.

Help

Problems, please contact me at chh398@ecs.soton.ac.uk and I will try to fix it as soon as possible.

I try to select more than one option in a reply, but I can't!

Please select only the single option that applies the most closely to you.

Your allowed responses seem a bit narrow, how do I elaborate on my answers?

Please use the comment boxes throughout the questionnaire to provide any other comments you have. Alternatively you can email them to me (chh398@ecs.soton.ac.uk).

I made a mistake, can I submit again?

Yes, although I will store all submissions, before using the data I will check for duplicates and use only the most recent submission by the same author.

Thank you for taking the time to participate in this research. The results will be reported on our website as soon as they are analyzed.

Cathy Hunt
chh398@ecs.soton.ac.uk

Appendix 4 - Perl Script to create and manage the databases

#!/usr/bin/perl

# Include relevant library files # Use the CGI module, which gives access to the form data submitted by the user use CGI qw/:standard/; # Use the database module – this access the database to add more # responses use DBI; # Use the strftime function from POSIX standard – converts data and time # to a string + returns string use POSIX qw(strftime); # Use socket to create a socket and associate it with a filehandle # use Socket; # use strict module to force me to declare variables use strict;

# Compile all of the question response into a single hash table # q1 => q1value # q2 => q2value # Join elements in hash tables and separate them by commas my @keys = param; my %response; foreach my $key (@keys) { $response{$key} = join(",", param($key)); }

# Add the date of this submission and the machine it was sent from # “localtime” converts time and date to localtime # Translate and return host name of computer of questionnaire user $response{date} = strftime("%Y/%m/%d-%H:%M:%S", localtime); $response{address} = (gethostbyaddr(inet_aton($ENV{REMOTE_ADDR}),AF_INET))[0];

# Print out start of the html response print header("text/html"), start_html("Survey Results"), h1("Survey Results");

# Use thankyou text printfile("thankyou.txt");

# Print out the questionnaire responses back to user # Start with html and sort the tags into numerical order to print in # table print ""; foreach my $key (sort keys %response) { print ""; } print "
Question NameResponse
", $key, "", $response{$key}, "
";

# Log the responses to a text file (in case the database breaks) if( open(FILE, ">>chh398$response{table}.log") ) { my @oneliner; foreach my $key (@keys) { my $onevalue = $key.'='.$response{$key}; $onevalue =~ s/\\/\\\\/g; $onevalue =~ s/\"/\\\"/g; $onevalue =~ s/\n/\\n/g; push(@oneliner, $onevalue); } # Join each element in array, and separate with a comma print FILE '"'; print FILE join('","', @oneliner); print FILE "\"\n"; close(FILE); }

# Specify location of database. Use DBSTR to identify driver to connect # to my $DBSTR = "dbi:mysql:database=cite_base;host=mocha.ecs.soton.ac.uk;por t=3306;";

# Connect to the database on mocha # dbh = database handle. Connect = connects to database handle object # Supply username and password my $dbh = DBI->connect($DBSTR, "tdb198", "hallibut", { RaiseError=>0, AutoCommit=>1 });

# Find out what tables there are my $SQL = "SHOW TABLES;";

my %tables; my $sth = $dbh->prepare($SQL); $sth->execute; while( defined(my $row = $sth->fetchrow_arrayref()) ) { $tables{$row->[0]} = 1; } $sth->finish;

# Check to see if the tables exist. If they do not - create them with # the given responses ie, fill out form with all the fields filled in if( exists($response{table}) && !exists($tables{"chh398".$response{table}}) ) { $SQL = "CREATE TABLE chh398$response{table} ("; foreach my $key (sort keys %response) { $SQL .= "_$key tinytext,"; } chop($SQL); $SQL .= ");"; $sth = $dbh->prepare($SQL) || print "Error preparing SQL: $!
"; $sth->execute || print "Error creating table: $!
"; $sth->finish; # If the tables have already been created - add the new responses } else { my $names = "_".join(",_",keys(%response)); my $values; for(my $i = 0; $i < (keys(%response)); $i++) { $values .= "?,"; } chop($values); # Add the record $SQL = "INSERT INTO chh398$response{table} ($names) VALUES($values);"; $sth = $dbh->prepare($SQL); my @values; foreach my $key (keys %response) { push(@values, $response{$key}); } $sth->execute(@values) || print "Error adding record: $!"; $sth->finish; }

# Disconnect the database $dbh->disconnect;

print end_html;

# Print and close file - subfunction sub printfile { my $fn = shift || return; if( open(FILE, "<$fn") ) { while( defined($_ = ) ) { print; } close(FILE); } }

Appendix 5 - Perl Script to manage queries using SQL and output the results in a number of formats.

# Use –w to throw any warnings #!/usr/bin/perl -w

# Use the CGI module, which gives access to the form data submitted by the user use CGI qw/:standard/; # Import a library to access the database use DB::DB; # Import an XML writing library – to pass back formatted XML use XML::Writer;

# use strict module - forces me to declare variables use strict;

# Create a new session with the database – username and pswd in # specified in survey script my $DB = new DB::DB(Username => 'web', Password => 'web');

# Get a handle to the database my $dbh = $DB->dbh;

# Declare new variables - sql query handler and the start time my ($sth,$starttime);

# Print out results if search using sql is started on the database if( param('sql') ne '' ) { # Set the start time $starttime = time(); # Read the sql command typed in my $SQL = param('sql'); # Prepare it – or throw error message $sth = $dbh->prepare($SQL); if( !$sth ) { error("There was an error in preparing your SQL query"); } # if there are no problems - execute the query – or throw error # # message if( !$sth->execute() ) { error("There was an error in executing your SQL query"); }

# Give the options to chose either XML, HTML or CSV for the way in # which results are presented if( param('output') eq 'XML' ) { print_xml(); } elsif( param('output') eq 'CSV' ) { print_csv(); } else { print_html(); }

$sth->finish;

} else { # Otherwise we just print the fill-in form print header, start_html; print_form(); print end_html; }

$DB->disconnect;

# Print an HTML fill-in form sub print_form { # Print out box to type sql query print start_form, start_html("SQL Query"), h1("SQL Query"), "", "", # Give radio buttons to select output format "", # Option to save to file "", "
Enter SQL:", textarea(-name => 'sql', -rows => 4, -cols => 40), "
Select output format:", radio_group(-name=>'output','- values'=>['HTML','XML','CSV'],-default=>'HTML',- linebreak=>'true'), "
Save the record to file:", checkbox(-name=>'save',-value=>'yes'), "
", # Button to start the search submit(-value=>'Search'), end_form; }

# Print the results in HTML format sub print_html { # To save, set the type to "application/bin", i.e. a BINary file # (can only save these) if( param('save') ) { print header(-type=>'application/bin; charset=UTF- 8',-charset=>'utf-8'); } else { print header(-type => 'text/html; charset=utf-8', -charset => 'utf-8'); } # Start the HTML (...) print start_html; # Because the output is in HTML, the user can also have the # form again print_form;

print h2("Results");

# Loop through the database until all records are retrieved and # print them in an HTML table while( defined(my $row = $sth->fetchrow_hashref()) ) { print ""; foreach my $key (sort keys %$row) { print ""; } print "
", $key, "", ($row- >{$key} || 'Undef'), "
"; }

# The start time set earlier is used so the time to work out the # query is calculated print p("Took ".(time()-$starttime). " seconds to run."); print end_html; }

# Print the results in XML format sub print_xml { # To save, set the type to "application/bin", i.e. a BINary file # (can only save these) if( param('save') ) { print header(-type=>'application/bin; charset=UTF- 8',-charset=>'utf-8'); } else { print header(-type=>'text/xml; charset=UTF-8',- charset=>'utf-8'); } # Create a new XML Writer object, set OUTPUT to stdout so it doesn't print before the header (above), we format it so its # readable - default would not put any whitespace in my $w = new XML::Writer(OUTPUT=>*STDOUT,DATA_MODE=>1,DATA_INDENT=>2);

# xmlDecl('UTF-8'); # $w->startTag('results'); # Retrieve all the records from the database and print them in # rows while( my $row = $sth->fetchrow_arrayref ) { $w->startTag('record'); for( my $i = 0; $i < @{$sth->{NAME}}; $i++ ) { $w->dataElement($sth->{NAME}->[$i], $row- >[$i]); } $w->endTag('record'); } # Print $w->endTag('results'); $w->end; }

# Print the results in CSV (Comma Seperated Variable) format so the # results can be loaded into Excel sub print_csv { # To save, set the type to "application/bin", i.e. a BINary file # (can only save these) if( param('save') ) { print header(-type=>'application/bin; charset=UTF- 8',-charset=>'utf-8'); } else { print header(-type=>'text/plain; charset=UTF-8',- charset=>'utf-8'); } # Put column names on the first row print '"',join('","',@{$sth->{NAME}}),'"',"\n"; # Output the data, escaping newlines to stop it breaking # Escape speech marks to double speech marks (m/s style) while( my $row = $sth->fetchrow_arrayref ) { for( my $i = 0; $i < @$row; $i++ ) { $row->[$i] =~ s/\"/\"\"/g; $row->[$i] =~ s/[\r\n]/\\n/g; } # Join the separate strings together separated by a comma print '"',join('","',@$row),'"',"\n"; } }

# Print error message sub error { my $msg = shift; print header, start_html("Error"), h1("Error"), p($msg), p($!), end_html; exit; }

Appendix 6 – Master Spreadsheet

Appendix 7 – Questions complete with Acronyms

Users Surveys

Spec 1. What are your areas of specialization?

ArInArXiv 2.1 Do you self-archive your papers in ......?

ArSince 2.2 If yes, since when have you been self-archiving in .......? PapsWr 3.1 Estimate the total number of papers written to date. PapsAr 3.2 How many of these papers have you self-archived on-line to date?

PapsP/Y 3.3 Estimate the average number of papers you write per year? WhyAr 4.1 Why did you start self-archiving? Feedback 4.2 How do you get feedback about your pre-refereeing drafts?

ArComms 4.3 Comments

ArElse/W 5.1 Have you archived your papers anywhere other than .......?

Where 5.2 If yes, where are they archived?

When 5.3 If yes, since when have you been self-archiving there?

NAr-Why 5.5 If you do not self-archive, why not?

PapsAr 6.0 How many papers have you archived in ........?

%CurrPAr 7.0 What percentage of all your current papers do you archive in ........?

%CurrPreAr 8.1 What percentage of your current pre-refereeing preprints do you archive?

WhyArPre 8.2 If you do archive these preprints, why?

WhyNArPre 8.3 If you do not archive these preprints, why not?

%ArInterPre 9.1 Of the intermediate revised drafts of your current pre-refereeing preprints, what percentage do you archive?

AvNoInterD 9.2 Indicate your average number of intermediate revised drafts per paper.

WhyArInterD 9.3 If you do archive these revised drafts, why?

WhyNArInterD 9.4 If you do not archive these revised drafts, why not? ImpOfRefUp 10.1 How important is it to update the reference of your archived preprint to match the published version once the refereed draft is published?

RefUpComm+ 10.2 Give reasons why you think it is important.

RefUpComm- 10.3 Give reasons why you think it is not important.

%ArRefdPaps 11.1 What percentage of your current refereed final drafts do you archive?

ArRefdComms+ 11.2 If you do archive these drafts, why?

ArRefdComms- 11.3 If you do not archive these drafts, why not?

%ArRefdCorr 12.1 What percentage of your revised updates or corrections of published final drafts do you archive?

AvCorrsP/Pap 12.2 Estimate your average number of revised updates or corrections per paper.

ArCorrComms+ 12.3 If you do archive these updates or corrections, why?

ArCorrComms- 12.4 If you do not archive these updates or corrections, why not?

%PapsArRetro 13.1 What percentage of your papers do you archive retrospectively (i.e., papers you wrote before you started self-archiving)?

ArRetroComms+ 13.2 Explain why you archive retrospectively, if you do.

ArRetroComms- 13.3 Explain why you do not archive retrospectively, if you do not.

14. Estimate what percentage of the current papers you read are:-

%PapsRUvR14.1 unrefereed preprints V refereed postprints. %PapsROlvOp 14.2 on-line V on-paper. 15. Estimate the percentage of online papers that you:-

%OlPapsS+B 15.1 Only skim and browse on-screen. %OlPapsD,S+B 15.2 Down-load and skim/browse. %OlPapsRO/S 15.3 Read fully on-screen. %OlPapsPr 15.4 Print out and read fully on paper.

16. Estimate what percentage of the current papers you cite are:-

%PapsCUvR 16.1 unrefereed preprints V refereed postprints. %PapsCOlvOp 16.2 on-line V on-paper. 17. In your experience, how much difference in content is there between your unrefereed preprints and your refereed postprints (i.e., how much substantive change results from the refereeing)?

%DiffPr/PtF 17.1 Preprint-to-postprint difference in the form of your paper.

%DiffPr/PtC 17.2 Preprint-to-postprint difference in the content of your paper. %DiffPr/PtR 17.3 Potential importance of the difference between the preprint and the postprint to the reader. DiffComms 17.4 Compare the case of your own papers with the case of papers by other researchers whose work you read and use.

18. As a user, what is your relative reliance on unrefereed preprints V refereed postprints in reading, citing, and trying to build on the work:-

Pre/PostR 18.1 Reading. Pre/PostC 18.2 Citing. Pre/PostB 18.3 Trying to build on the work. Pre/PostComms 18.4 Explain your thoughts and practises, and state any reasons that come to mind. 19. As a reader, what proportion of your reading of research papers is from:-

%RFrOlAr 19.1 Free on-line archives like ........

%RFeeOlAr 19.2 Fee-based journal on-line archives. %RJour 19.3 On-paper journals (or on-paper preprints/reprints)? %RComms 19.4 Explain your practices (e.g., availability, reliability, accessibility, cost, authentication). PrioConc 20.1 Do you have any concerns about priority problems?

PrioComms 20.2 Explain your concerns, if any.

PlaConc 21.1 Do you have any concerns about plagiarism problems?

PlaComms 21.2 Explain your concerns, if any.

CpyConc 22.1 Do you have any concerns about journal copyright problems?

CpyComms 22.2 Explain your view, if any.

EmbConc 23.1 Do you have any concerns about journal embargo policy arising from your self-archiving?

EmbComms 23.2 Explain your concerns, if any.

ArUAAvS 24. As an archive user, would you prefer to be automatically alerted of new archived papers in your interest area, or would you prefer to search the archive when you choose to, or both?

25. What effect do you think archiving has (or would have) on the impact of your research in the following areas:-

EArVis 25.1 Visibility of your work.

EArCit 25.2 Citation of your work.

EArRep 25.3 Replication/application of your work.

EArInfl 25.4 Influencing further work of others.

EArRecg 25.5 Official recognition of your work.

EArImm 25.6 Immediacy of your work.

EArComms 25.7 Elaborate any further comments that come to mind.

Non-Users Surveys

Spec 1. What are your areas of specialization?

ArIn…..2.1 Do you self-archive your papers in ......?

ArSince 2.2 If yes, since when have you been self-archiving in .......? PapsWr 3.1 Estimate the total number of papers written to date. PapsAr 3.2 How many of these papers have you self-archived on-line to date?

PapsP/Y 3.3 Estimate the average number of papers you write per year? HowFeedB 4.1 How do you get feedback about your pre-refereeing drafts?

FeedComms 4.2 Comments

ArElseW 5.1 Have you archived your papers anywhere other than .......?

Where 5.2 If yes, where are they archived?

When 5.3 If yes, since when have you been self-archiving there?

WhyAr 5.4 If yes, why did you start self-archiving?

WhyNAr 5.5 If you do not self-archive, why not?

WhyNUse… 6.0 If you do not use ....... to deposit your papers, but use other archives to do so, please indicate why you do not use .......

%CurrPAr 7.0 What percentage of all your current papers do you archive?

%CurrPreAr 8.1 What percentage of your current pre-refereeing preprints do you archive?

WhyArPre 8.2 If you do archive these preprints, why?

WhyNArPre 8.3 If you do not archive these preprints, why not?

%ArInterPre 9.1 Of the intermediate revised drafts of your current pre-refereeing preprints, what percentage do you archive?

AvNoInterD 9.2 Indicate your average number of intermediate revised drafts per paper.

WhyArInterD 9.3 If you do archive these revised drafts, why?

WhyNArInterD 9.4 If you do not archive these revised drafts, why not? ImpOfRefUp 10.1 How important is it to update the reference of your archived preprint to match the published version once the refereed draft is published?

RefUpComm+ 10.2 Give reasons why you think it is important.

RefUpComm- 10.3 Give reasons why you think it is not important.

%ArRefdPaps 11.1 What percentage of your current refereed final drafts do you archive?

ArRefdComms+ 11.2 If you do archive these drafts, why?

ArRefdComms- 11.3 If you do not archive these drafts, why not?

%ArRefdCorr 12.1 What percentage of your revised updates or corrections of published final drafts do you archive?

AvCorrsP/Pap 12.2 Estimate your average number of revised updates or corrections per paper.

ArCorrComms+ 12.3 If you do archive these updates or corrections, why?

ArCorrComms- 12.4 If you do not archive these updates or corrections, why not?

%PapsArRetro 13.1 What percentage of your papers do you archive retrospectively (i.e., papers you wrote before you started self-archiving)?

ArRetroComms+ 13.2 Explain why you archive retrospectively, if you do.

ArRetroComms-13.3 Explain why you do not archive retrospectively, if you do not.

14. Estimate what percentage of the current papers you read are:-

%PapsRUvR 14.1 unrefereed preprints V refereed postprints. %PapsROlvOp 14.2 on-line V on-paper. 15. Estimate the percentage of online papers that you:-

%OlPapsS+B 15.1 Only skim and browse on-screen. %OlPapsD,S+B 15.2 Down-load and skim/browse. %OlPapsRO/S 15.3 Read fully on-screen. %OlPapsPr 15.4 Print out and read fully on paper.

16. Estimate what percentage of the current papers you cite are:-

%PapsCUvR 16.1 unrefereed preprints V refereed postprints. %PapsCOlvOp 16.2 on-line V on-paper. 17. In your experience, how much difference in content is there between your unrefereed preprints and your refereed postprints (i.e., how much substantive change results from the refereeing)?

%DiffPr/PtF 17.1 Preprint-to-postprint difference in the form of your paper.

%DiffPr/PtC 17.2 Preprint-to-postprint difference in the content of your paper. %DiffPr/PtR 17.3 Potential importance of the difference between the preprint and the postprint to the reader. DiffComms 17.4 Compare the case of your own papers with the case of papers by other researchers whose work you read and use.

18. As a user, what is your relative reliance on unrefereed preprints V refereed postprints in reading, citing, and trying to build on the work:-

Pre/PostR 18.1 Reading. Pre/PostC 18.2 Citing. Pre/PostB 18.3 Trying to build on the work. Pre/PostComms 18.4 Explain your thoughts and practises, and state any reasons that come to mind. 19. As a reader, what proportion of your reading of research papers is from:-

%RFrOlAr 19.1 Free on-line archives like ........

%RFeeOlAr 19.2 Fee-based journal on-line archives. %RJour 19.3 On-paper journals (or on-paper preprints/reprints)? %RComms 19.4 Explain your practices (e.g., availability, reliability, accessibility, cost, authentication). PrioConc 20.1 Do you have any concerns about priority problems?

PrioComms 20.2 Explain your concerns, if any.

PlaConc 21.1 Do you have any concerns about plagiarism problems?

PlaComms 21.2 Explain your concerns, if any.

CpyConc 22.1 Do you have any concerns about journal copyright problems?

CpyComms 22.2 Explain your view, if any.

EmbConc 23.1 Do you have any concerns about journal embargo policy arising from your self-archiving?

EmbComms 23.2 Explain your concerns, if any.

ArUAAvS 24. As an archive user, would you prefer to be automatically alerted of new archived papers in your interest area, or would you prefer to search the archive when you choose to, or both?

25. What effect do you think archiving has (or would have) on the impact of your research in the following areas:-

EArVis 25.1 Visibility of your work.

EArCit 25.2 Citation of your work.

EArRep 25.3 Replication/application of your work.

EArInfl 25.4 Influencing further work of others.

EArRecg 25.5 Official recognition of your work.

EArImm 25.6 Immediacy of your work.

EArComms 25.7 Elaborate any further comments that come to mind.

MakeYouAr 26. If you do not archive your work at present, what would incline you to do so?

Appendix 4

Cogprints Non-Users User Survey

Tenopir, D., (18 May 1998), “Electronic Publishing takes Journals into a New Realm”, Chemical and Engineering News

Arms, W., (March 2000), "Economic models for open-access publishing." iMP, http://www.cisp.org/imp/march_2000/03_00arms.htm

Arms, J., (2000), Digital Libraries, 1st Ed., The MIT Press

Bush, V. (1945), As We May Think. http://www.theatlantic.com/unbound/flashbks/computer/bushf.htm

Arms, J., (2000), Digital Libraries, 1st Ed., The MIT Press

Harnad, S., Carr, L., (10 September 2000), Integrating, Navigating and Analyzing Open Archives Through Open Citation Linking (The OpCit Project), Current Science, Vol 79, No.5, http://www.cogsci.soton.ac.uk/~harnad/Papers/Harnad/harnad00.citation.htm

Bachrach, S., et al, (Sept 4, 1998), Who should own scientific papers?, Science 281 (no. 5382), pp. 1459-1460.

The Los Alamos Eprints Archive, http://xxx.soton.ac.uk

Harnad, S., Carr, L., (10 September 2000), Integrating, Navigating and Analyzing Open Archives Through Open Citation Linking (The OpCit Project), Current Science, Vol 79, No.5, http://www.cogsci.soton.ac.uk/~harnad/Papers/Harnad/harnad00.citation.htm

Brody, H., (1996), Wired Science, http://www.techreview.com/articles/oct96/brody.htm

Bachrach, S., et al, (Sept 4, 1998), Who should own scientific papers?, Science 281 (no. 5382), pp. 1459-1460.

Odlyzko, A.M., (1994), "Tragic loss or good riddance? The impending demise of traditional scholarly journals", Intern. J. Human-Computer Studies (formerly Intern. J. Man-Machine Studies) 42

Odlyzko, A.M., (2000), The rapid evolution of scholarly communication, http://www.research.att.com/~amo The Open Citation Linking Project (OpCit), http://www.opcit.eprints.org/

Brody, T., Mining the Social Life of an E-Print Archive, http://opcit.eprints.org/tdb198/opcit/

Hickman, I., Mining the Social Life of an E-Print Archive, http://opcit.eprints.org/ijh198/

Atkinson, M., (1997), Business Basics Marketing, BPP Publishing

Holzschlag, M.E., (2000), Using HTML 4, 6th Ed., QUE and

Ray, D.S., & Ray, E.J., (1999), Mastering HTML, SYBEX

Catherine Hunt chh398@soton.ac.uk Progress Report Page 61