Self-Archiving Practices: Online Indicators and
1st Supervisor: Professor Stevan Harnad 2nd
Supervisor: Dr. Andy Gravell
Johannes Gutenberg's printing advances in the 15th century loosened the
controls on publishing by broadening and democratising information
content and access, and ultimately helped to promote the cultural and
scientific upheavals that followed. The Internet- and the World Wide
Web in particular- represent another step in this ongoing revolution.
Online archives on the World Wide Web (WWW) may eventually supersede
the original practice of scholars publishing research in paper
journals. In some areas of research, the preliminary stages involved in
writing a paper are already making use of the WWW to make research
Before a paper is published in a journal an early version is submitted
to a journal editor who sends it to experts for review. These peers
then recommend accepting, rejecting or revising the paper for
publication. Papers may be returned to the author for revision, and the
submission process repeated until the paper is published.
In both the pre-refereeing 'pre-prints' and the final published
'post-prints', the author cites other authors' work that has been used
in the research. If a reader wishes to look at a cited piece of work,
that paper has to be found elsewhere.
The Los Alamos Eprint Archive (arXiv) was set up as an online open
archive on the WWW, to house papers in Physics. The idea evolved from
the practice of authors emailing pre-prints of their papers to peers
for informal feedback. This idea was expanded in the form of the
arXiv. Authors can put draft copies of the paper up for public view
and update them accordingly, until the final, peer-reviewed published
The Open Citation-linking Project (OpCit) is already investigating the
practices of arXiv users and authors.
My aim is to extend the ongoing research by investigating the relation
between the objective online indicators and the authors' own verbal
reports of their practices and rationale in archiving their work.
I would like to express my gratitude to my family who have helped me
through the good times and the bad, my tutor, Professor Stevan Harnad
for keeping everything in perspective, and Tim Brody for walking me
through the technical aspects of my studies. Without any of them, this
work would not have been possible.
2. Background and Literature Search4
3. Report on Technical Progress....5
4. Plan and Assessment of Remaining Work.8
6.1. Survey of Users of arXiv..12
6.2. Survey of Non-Users of arXiv..13
6.3. Survey of Users of Cogprints14
6.4. Survey of Non-Users of Cogprints...15
This report has been compiled to detail all work carried out from
October 2000 May 2001 for the final third year project. The original
project brief, compiled in October 2000 can be found in appendix 1.
This section, however, aims to give more detail than the brief about
how exactly ideas for the project started and evolved.
With the advent of the World Wide Web (WWW), it seemed inevitable that
most individuals and businesses would use the Internet as a wider form
of dissemination of their wares. Indeed, from individuals wishing to
use this information superhighway as a means of making holiday snaps
available to friends on their computers, to multi-national companies
making their business even more accessible to other companies, it is in
no doubt that the WWW has superseded the more traditional methods of
The Internet has touched most individuals who have access to a
computer, and those who are expert users and those who wish to learn
are proficient or dabbling in making work available to other computer
users. It is possibly the easiest and fastest way of finding
information without even having to leave the house. Information is
available from the desktop computer.
One particular area of the WWW that is considered relatively new is the
field of the digital library. For the first time in history it is
possible to build large-scale services where collections of information
are stored in digital formats and retrieved over networks. The digital
library is bringing together facets of many disciplines and experts in
different backgrounds with different approaches.
This is a fundamental period in the history of libraries and
publishing, as technology is advancing to the point where in a
completely digital library, nothing ever need reach paper. In other
words, the digital library brings the library to the user. Since the
1980s, scientists and librarians have predicted the use of networks to
disseminate information. Two factors that are having a great influence
on the future of the concept are the rate at which these collections
will become available on the web and what the economic balance will be,
with some collections offering open access and others paid directly by
their users. It is not yet certain what this balance will be .
Despite the fact that digital libraries are only really coming to the
forefront now, with the explosion of the World Wide Web (WWW), the
vision is not a new one. In 1945, an article was published by Vannevar
Bush called As We May Think . In it, he commented that methods of the
time for transmitting and reviewing the results of research were old
and by now are totally inadequate for their purpose. He went on to
discuss technological advances and provided an outline for one possible
approach for storage of information in what he called Memex. The Memex
design used photography to store information and for many years,
microfilm was the technology perceived as the most suitable for storing
The first serious attempt to store information on computers was in the
late 1960s . The early systems were no threat to the printed document
as only unformatted text could be displayed in a fixed-spaced font
without diagrams, mathematics or the graphic quality that makes reading
easy. Add these factors to the monitors that were used, with low
resolution and poor contrast; it was not easy to convince people to
read from a screen. It is in the past thirty years that we have seen
the technological barriers being broken down. In the early 1990s, it
was the lowering costs of computing and the explosion of online
services that meant we started to see the development of digital
There is a branch in the area of the digital library within the WWW
known as on-line archiving that has served as the scope of focus for
this project. Online archives may eventually supersede the original
practice of scholars publishing research in paper journals. In some
areas of research, the preliminary stages involved in writing a paper
are already making use of the WWW to make research findings public.
These archives are still in their infancy, having started to appear
around ten years ago (e.g., Ginsparg's preprint and archive database
developed at the Los Alamos National Laboratory http://ariv.org/, and
more recently, Harnads Psycoloquy
http://www.cogsci.soton.ac.uk/psycoloquy/). Yet, despite the explosion
of the WWW, they are still only being used by a minority of authors
(30-40%), and only in a very few fields (mostly Physic, Mathematics and
Computer Science). Some authors appear to still prefer the traditional
paper journal as the sole method of disseminating their research.
Before a paper is published in a journal, an early version is submitted
to a journal editor who sends it to experts for review. These peers
then recommend accepting, rejecting or revising the paper for
publication. Papers may be returned to the author for revision; the
submission process is repeated until the paper is published (or
rejected). However, this process is timely, and according to Harnad:
turnaround times were still slow and uncertain: The requested reprints
were a long time in coming, if they came at all, in the non-first
world; and even in the first, a lot of time and resources were wasted
retrieving reprints. and even the relevant hits had to be committed to
growing, groaning reprint shelves in labs, which eventually created
storage, navigation and retrieval problems of their own.
The on-line archives evolved from the practice of authors emailing
pre-prints of their papers to peers for informal feedback. With the
archives, authors can deposit their pre-refereed work (pre- prints) and
published work (post-prints) into an archive for all to see. The cost
and sheer number of refereed journals that exist (over 20,000 according
to Bowkers), it is simply not possible for libraries and institutions
to subscribe to more than a small portion of all the papers published
every year. For those trying to locate papers, the process can often be
a long, laborious and often fruitless one.
According to Steven Bachrach et al:
Electronic communication has created new ways to distribute such
results and is forcing researchers and publishers to reassess the old
procedures and consider new possibilities as we learn to use the
Internet. Now, not only can authors easily disseminate their results,
but networked readers can have cheap, fast access to more scientific
literature and have it in a form that facilitates its use in their own
With on-line archives, all papers can be located by anyone quickly and
easily -- and at no cost. Authors can put draft copies and successive
updates up for public view, until the final, peer- reviewed (published)
version appears. Users can follow the research through all of its
successive stages from pre-prints through to the post-prints.
In both the pre-refereeing pre-prints and the final published
post-prints, the author cites other authors' work that has been used in
the research. If a reader wishes to look at a cited piece of work, that
paper has to be found elsewhere. This is time-consuming, and often ends
with the paper being inaccessible, because the user's institution
cannot afford to subscribe to it. On-line archives are developing ways
of linking papers to all the papers they cite (OpCit Project). The user
only needs to click on the citation -- as long as it too is archived
The Los Alamos Eprint Archive (arXiv) was set up as an online archive
on the WWW, to house research literature in Physics. The first of its
kind, arXiv started ten years ago in 1991 and has been expanding ever
since. It has over 130,000 papers deposited to date.
Cogprints was set up in 1997 to archive papers in the areas of
Psychology, Neuro-Science, Linguistics, Computer Science, Philosophy
and Biology. It has approximately 1000 papers deposited to date.
The Open Citation-linking Project (OpCit) is already investigating the
practices of arXiv users and authors. It is currently developing tools
to make the existing resources more powerful by completely citation
inter-linking all of the papers in arXiv and eventually to extend this
to all the rest of the disciplines in other open archives . For the
purposes of the third year project this investigation has been extended
to investigate a previously unexplored avenue. A fundamental part of
the usage of an on-line archive is the habits of the users themselves.
The aim has been to extend the ongoing research by investigating the
relation between the objective online indicators and the authors' own
verbal reports of their practices and rationale in archiving their
Particular questions that were asked in the early stages of the project
? Why do authors use the archive?
? Do some areas of Physics archive or use the archive more than
? At what point does an author decide to archive a draft?
? Do authors cite pre-prints, published post-prints, or both, and under
? When an author cites a pre-print, do they update the citation when
the cited paper is revised or published?
? If a paper is eventually accepted by a journal, does the author
update the text of the paper or just the reference information?
? What are the authors' practices in archiving successive drafts?
? What are the authors' practices in citing successive drafts?
? What is the relationship between the impact factor of an author and
download frequency and other online performance indicators and
Speculative reasons were presented as to why physicists use on-line
archives more than other fields of research. The reasons are as
1. Physicists have a much stronger pre-print culture ; they had been
mailing or emailing copies of their work to one another well before
arXiv began, and arXiv was simply a more efficient way of continuing
2. Physicists work within a teX culture. They all write their documents
in this format making it easier to circulate information. Physicists
may rely more on each other's work, whereas in Computer Science, fewer
papers are written and the development in the field does not depend so
closely on earlier papers in the subject area.
3. Physicists are more serious about their research and the speed in
which they carry it out is important, making archiving the obvious
practice to use.
4. The difference between un-refereed pre-prints and refereed
post-prints is minimal.
And, indeed the question of why authors do not use archives needs to be
addressed. The issue of copyright of papers belonging to the journal
may be a reason for not everyone embracing the idea of on-line
archives. According to Steven Bachrach et al:
It is expected that some journal publishers will feel threatened by so
fundamental a change in ownership practice. The most important concern
for publishers and authors alike is that the Internet enables anyone to
create new electronic publishing means. Such new distribution outlets
may well overtake traditional publishing institutions, particularly
when those institutions fail to keep up with the evolving needs of a
Perhaps it is the traditional journals and their policies that stop
authors experimenting with new technology? Whether it has anything to
do with the journals policies or not, it seems to be the view of some
that the journals will likely disappear within 10 to 20 years, and
Publications delays will disappear, and reliability of the literature
will increase with opportunities to add comments to papers and attach
references to later works that cite them. The use of electronic forms
of scholarly information has typically been growing at 50 to 100
percent per year , so it will be interesting to see if these factors
are reflected in the questionnaire responses.
This empirical research has been carried out with the use of
questionnaires submitted to the users of on-line archives, to return
measures of the views of both the authors and the users of the
Original Research Carried out for the OpCit Project
The Open Citation Linking Project (OpCit) is a funded project being
carried out within the university, currently developing tools to make
the existing resources more powerful by completely citation
inter-linking all of the papers in The Los Alamos Eprint Archive
(arXiv) and eventually to extend this to all the rest of the
disciplines in other open archives. The project has already seen other
researchers, Tim Brody and Ian Hickman investigating the
relationships between user practices and on-line indicators. However,
the research focussed on the citation impact of papers and the usage
patterns of arXiv, rather than the patterns of the users themselves.
The first decision that needed to be made was how to go about empirical
research. The questions that had already been asked in the project
brief were leaning towards the extraction of personal views from both
the authors and readers of E-Print archives. It seemed an obvious
choice that the easiest method of getting views was to design a
questionnaire. This would produce regular set of results that could be
compared with relative ease.
To question the maximum number of authors and readers, following
previous discussion of the WWW, the logical solution was to design a
web-based questionnaire to be used on the Internet.
The second decision was who to target with the questionnaires. It was
paramount that they reached a set of people that were both users and
non-users of on-line archives, and that they were either authors and/or
readers of academic papers. For instance, it would be of little use to
target someone who does not read or write academic papers.
Each decision is fully documented in the following sections.
The Target Set
The original plan specified sampling authors and readers of academic
papers in on-line archives. However, it was ascertained that this set
of people would not necessarily encompass non-users of on-line
archives. The decision was made to separate the sample set into users
and non-users, both as archivers and users. This way, the question of
why and why not people use these archives would be answered, and
hopefully give a reason why archives have not yet superseded the idea
of paper journals.
Again, the original plan specified that the set of people targeted
would be the users of arXiv. With the change in decision of who to
target, this would now encompass both users and non-users of arXiv.
However, this would not answer the question of why physicists have a
much stronger pre- print culture than other areas of research. Thus, to
make the results much broader and more impartial the decision was made
to also direct the questionnaires at the users and non-users of
Cogprints. (Cogprints was set up in 1997 to archive papers in the areas
of Psychology, Neuro- Science, Linguistics, Computer Science,
Philosophy and Biology. It has approximately 1000 papers deposited to
With the inclusion of a new set of people, this would now also answer
the question of why the researchers from a Computer Science background
are slower than Physicists to embrace this new technology even though
they have had the more direct hand on developing it.
With four subsets of people to question; arXiv users, arXiv non-users,
Cogprints users and Cogprints non-users; the overall results would be
made subjective by exploring the habits of researchers in a different
How to Reach the Target Set
The next task was to make sure that people from the target sets would
somehow have access to the relevant questionnaire. A number of avenues
were explored to find the most appropriate method of getting them to
see the questionnaires.
Research that was carried out previously for the OpCit project involved
gaining permission from arXiv to target one hundred and thirty four
authors who used the archive to deposit papers, using their email
address. This resulted in thirty-four responses: a 27% success rate.
This was only a small subset of arXiv authors. A larger population was
required to get a representative sample of archiving and usages
practices. As the new survey required a greater number of responses,
and the non-users of archives could not be located so easily, a
decision was made to put a link button on as many websites that authors
and readers of academic might visit as possible. (A link is an area on
an Internet web page that if clicked, takes the user to another
specified web page. In this case, the user was alerted to the fact
there was a survey taking place that might be of interest of them.
Clicking on the relevant link took them to a page offering a choice of
the four questionnaires.)
The starting point was to request a link on the arXiv main web site in
Los Alamos, followed by links on the mirror sites around the world.
Then a link was requested on the Cogprints web site. Both were
successfully granted. Thereafter, links were requested and placed on
the following web sites: -
It should be noted that all of the web sites are areas that both
readers and authors of academic papers are highly likely to use.
The questionnaires were deposited in the authors university file space
allowing access from anyone with a link to them.
There were four main considerations in planning the questionnaires:
? Questions should generate valid and reliable information on the
matter being surveyed.
? Respondents must find the questions comprehensible.
? Questions likely to produce biased answers should be avoided.
? The questionnaire should be designed with the subsequent data
analysis in mind.
It was accordingly decided that (i) the questions should be short and
unambiguous, (ii) they should fall into a logical sequence and (iii)
leading questions should be avoided.
The process of designing the questions was an important element in the
design process. The above points were taken into account, and the
questions were written and rewritten on many occasions to be completely
unbiased and offer options that the user could chose to cover all
eventualities. The users of the questionnaires were offered the chance
to give hand typed comments throughout the questionnaire to make their
opinions absolutely clear, thus completely avoiding un-intentional
As there were four different sets being sampled, there needed to be
four sets of questions designed. It was decided that there should be
two main questionnaires designed initially one for users of the
archives, and one for non-users of archives. In order to make the
comparison of results easy and impartial, the questionnaires needed to
be designed in parallel. The questions were designed so that the target
sets were asked the same questions, except in circumstances where a
question was completely irrelevant. In the case of questions not
running in parallel, another relevant question was asked. For example,
questions 1 5 are the same in all questionnaires, however, in question
6 a different question is used for users and non-users, and then
questions 7 14 are the same in all of the questionnaires.
The obvious choice was to run all questions in parallel and then
include the odd questions at the end. However, the original decision
was to keep the questions in logical sequence, making the
questionnaires as easy to use as possible.
To make the questionnaires complete, text was composed to introduce the
user to why the survey was being carried out, and relevant contact
information. Each user of the questionnaires had an option to say who
they were, and offer further comments in connection with the
The two master sets of questions users and non-users can be found in
appendix 2. The four final individual questionnaires can be visited at:
Alternatively, the final questionnaires can be found on the floppy disk
marked.. at the back of this report.
The Questionnaires Technical Aspects
Once decisions had been made about the questions to be asked, and in
which way they would be transcribed, how to actually carry out the task
was the next problem.
The questionnaires were to be web-based. Two options of how to set them
up were considered; HyperText Markup Language (HTML) and Microsoft
Access (MS Access). MS Access offers functionality including wizards to
automatically design web forms templates using a database set up by the
user. Having little knowledge of MS Access, the decision was made to
avoid this option, as time was in short supply.
Having used HTML previously, it seemed the natural solution for setting
up the questionnaires. It is an easy to learn mark-up language that
offers complete control and freedom when considering the aesthetics of
the web pages and can be linked to scripts that will process the
results from web forms. For these reasons, HTML was the chosen format
for the questionnaires.
Following consultation with Tim Brody, who had already designed a basic
questionnaire for research with the OpCit project, the design provided
a starting point for learning how to create the four new
questionnaires. Basic techniques for creating web-based questionnaires
were explored, including how to create form fills, radio buttons and
drop-down menus. Two books on HTML were used to learn to master these
The four final versions of the questionnaires made use of option boxes
that users could scroll through if there were a large number of
choices, comment boxes that offered the users the chance to type in
personal comments, and drop-down menus that offered a larger choice of
options if the question required it.
The four questionnaires were made uniform, for the reasons discussed
earlier in the report. This facilitated the data processing and
comparisons. At this stage, there were essentially two basic layouts,
one for the users of both archives and the other for the non-users. The
wording of the titles and explanations was different according to the
Many revisions of the questionnaires were made to ensure that the
questions were worded correctly and readily understandable.
Figure 1 shows a small amount of the raw HTML used for the
2.1 Do you self-archive your papers in arXiv?
2.2 If yes, since when have you been self-archiving in arXiv?
Figure 1. Example of HTML used in the questionniares
Each question was individually tagged giving it a unique reference so
that when the results were processed, they could be easily identified.
There are a number of options for each question, or an optional text
field. The chosen option value or text is matched to the tag.
One particular problem that was encountered was how to create the
drop-down option menus for questions such as question 3.1 that asked
how many papers an author had written in their lifetime. The answer
could conceivably be anything from zero to one thousand. To offer an
option box with one thousand and one choices and corresponding tags
would have taken too much time, so a solution had to be found.
A second problem was how to tag the responses numerically. The original
plan was to use the decimal number system. However, after testing this
system, it was discovered that all tags starting with 1 were listed
first, then those starting with 2, resulting in 1, 11, 12, 13,., 19,
20,21,22Obviously, this resulted in the questions becoming muddled and
made the results confusing. Using the binary number system to force the
results into correct numerical order solved the problem. Thus, the tags
look like: - 001, 002, 003, ., 011, 012, 013,., 020, 021
Following consultation with peers, the suggestion was made to use php.
This entailed writing a small piece of code within the HTML that would
dynamically produce all of the options with tags to match. The php code
used is seen in Figure 2.