Re: Central versus institutional self-archiving

From: Stevan Harnad <>
Date: Wed, 4 Oct 2006 22:22:08 +0100

  Comments on:

      Ginsparg, Paul (2006) As We May Read. The Journal of Neuroscience,
      September 20, 2006, 26(38): 9606-9608

> '[A]rticles are deposited [in Arxiv] by researchers when they
> choose (either before, simultaneous with, or after peer review),
> and the articles are immediately available to researchers throughout
> the world.

Arxiv is a Central Repository (CR) in which physicists (mostly, and many
mathematicians, and some computer scientists) have been self-archiving
their unrefereed preprints and their peer-reviewed postprints since
1991. It is important to keep in mind that researchers self-archive
preprints as well as postprints, because it makes a big difference
whether one extrapolates from Arxiv as a preprint CR or a postprint CR,
as we shall see below.

It is also pertinent to bear in mind that Arxiv is indeed a *Central*
Repository (CR), because there is now a growing movement toward
distributed Institutional Repositories (IRs). The IR movement was
facilitated by the Open Archives Initiative (OAI) Protocol for Metadata
Harvesting, which renders all IRs and CRs interoperable: the OAI Protocol
was in turn created partly as a result of an initiative from Arxiv.

As a consequence of the OAI Protocol, all OAI-compliant IRs and CRs are
interoperable: their metadata can be harvested into search engines that
treat all of their contents as if they were in one big virtual CR.

> As a pure dissemination system, [Arxiv] operates at a factor of
> 100-1000 times lower [1.0% - 0.1%] in cost than a conventionally
> peer-reviewed system (Ginsparg, 2001)."

This is true, but it is tantamount to saying that as a pure dissemination system,
photocopying the articles published in journals operates at a fraction of the
cost of publishing a journal: A fraction, but a parasitic fraction, for without
the journal, there would be nothing to either photocopy or distribute in Arxiv.

Nothing but the unrefereed preprint, that is. And this brings us face
to face with the fundamental question: What are the true costs of peer
review, and peer review alone? The peers (scarce, overused resource though
they are) review for free, so it is not their services whose costs we are
talking about, but the cost of implementing the peer review: processing
the submissions, picking the referees, processing their reports, deciding
what revisions need to be done to meet the journal's quality standards
for acceptance, and deciding -- perhaps again by consulting the referees
-- whether those revisions have been successfully done. The selection of
referees and the decision as to what needs to be done is usually made by a
qualified, answerable super-peer: the editor (or a board of editors). The
editor(s) services, and the clerical services for processing submissions,
communicating with referees, and processing referee reports are the costs
involved -- and these include not just accepted papers, but rejected
ones too (with some journals' rejection rates being over 90%).

In other words, peer-reviewed journal publishing is not a "pure
dissemination system." Implementing the peer review costs some money
too. There are estimates of what it costs ($500 per paper was the average
estimate a few years ago, which is between one-third and one-sixth of
the charge per article that today's "Open Choice" journals are currently
proposing -- although a few journals with high rejection rates have
suggested a figure of $10,000 per article, without making it clear whether
this represents their costs per article or their income per article).

The annual cost per paper in Arxiv, to Arxiv, has been estimated at about $10
(a few years ago), so this is indeed somewhere between 2% of the low-end estimate
and 0.1% of the high-end estimate. If we include the cost of keying in the
deposit to the depositor, it's a few pennies more.

But what do these figures mean? Why compare the cost of online
dissemination alone with the cost of peer review (or any of the other values a
journal adds, such as the print edition, copy-editing, reference-checking,
and mark-up)?

> "with many of the production tasks automatable or off-loadable to
> the authors, the editorial costs will then dominate the costs of an
> unreviewed distribution system by many orders of magnitude."

Translation: Online dissemination of unrefereed preprints alone costs
a lot less than peer-reviewed publication. True, but what follows
from that? Peer-reviewed publication costs a lot more than photo-copying
too, but what authors photocopy and distribute is their peer-reviewed
publications, not just their unrefereed preprints.

> "Although the most recently submitted articles have not yet
> necessarily undergone formal review, the vast majority of the
> articles can, would, or do eventually satisfy editorial requirements
> somewhere.... [Arxiv's moderated] submissions are at least 'of
> refereeable quality'."

Every paper is first an unrefereed preprint -- and then, eventually, most
are revised into peer-reviewed, accepted articles (postprints). Hence if
preprints are deposited in Arxiv at all, it stands to reason that Arxiv's
most recently deposited (sic) papers (sic) have not yet undergone peer
review. Tune in a year later, and they will have been, with the revised
postprint now also deposited.

Preprints and postprints are deposited rather than "submitted" to IRs
or CRs, because an archive is merely a repository, not a certifier of
having met a peer-reviewed journal's quality standards: let's reserve
"submission" for the attempt to meet a journal's peer-review quality
standards. Moreover, unrefereed preprints are merely papers, not articles;
they become articles when they have been accepted for publication by a
peer-reviewed journal. This is not pedantry or formalism. It is merely
the sorting out of what has and has not met known quality control
standards. The tag certifying this is currently the journal name,
with its established quality level and track-record. A peer-reviewed
journal (apart from its function as an access-provider) is a peer-review
service-provider/certifier, answerable for its quality standards with
its own prestige and reputation.

It is not at all clear what an IR's or CR's certification of which of
its deposits is "of refereeable quality" might mean to busy researchers
who need to know whether the paper is worth risking their limited time to
read and try to use, apply and build upon. They currently to do this by
seeing whether and where it has been published (with the journal name and
track record serving as their indicator of the article's probable level
of quality, reliability and validity). Unrefereed preprints have always
something handled with care, with only the author's name, institution
and prior track-record as a guide to their reliability. Is Arxiv's tag
of being "of refereeable quality" meant to serve as a further guide? or
as a substitute for something?

> "[P]roposed modifications of the peer review include a two-tier system
> (for more details, see Ginsparg, 2002), in which, on a first pass,
> only some cursory examination or other pro forma certification is
> given for acceptance into a standard tier. At some later point,
> a much smaller set of articles would be selected for more extensive
> evaluation."

This is a speculative hypothesis. It is no doubt being tested to see
whether it works, whether it delivers results of quality and functionality
comparable to standard peer review, whether it is cost-effective, and
whether it can replace journals. But as it stands, the hypothesis alone
does not tell us whether and how well it will work; Arxiv is certainly
not evidence for the validity of this hypothesis, since virtually all
papers in Arxiv still undergo standard peer review. Arxiv is merely a CR
that provides Open Access (OA) to both the preprints and the postprints.

> "using standard search engines, more than one-third of the high-impact
> journal articles in a sample of biological/medical journals published
> in 2003 were found at nonjournal Web sites (Wren, 2005)."

This is very interesting. This is the higher end of a self-archiving rate
that we have found to range between about 5% and 25% across disciplines. Physics
is of course even higher (mostly because of Arxiv) and computer science
higher still (see Citeseer).

> "at least 75% of the publications listed [in neuroscience] were
> freely available either via direct links from the above Web page or
> via a straightforward Web search for the article title."

This is even more interesting. It means that in such fields the majority
of the articles -- note that we are almost certainly not talking about
unrefereed preprints here but about peer-reviewed postprints -- are being
self-archived already, so the only thing that remains to be done is to
deposit (or harvest) them into the author's own OAI-compliant IR rather
than a random website, to maximise visibility, harvestability, and impact.

> "The enormously powerful sorts of data mining and number crunching
> that are already taken for granted as applied to the open-access
> genomics databases can be applied to the full text"

Indeed. And semantic and scientometric analyses too (though article texts are
not quite the same thing as the research data on which the articles are based,
hence the analogy with the genomics data base may be a bit misleading).

> "it is likely that more research communities will join some form of
> global unified archive system without the current partitioning and
> access restrictions familiar from the paper medium"

What makes it most likely is the self-archiving mandates proposed or already
adopted the world over (e.g. RCUK, Wellcome Trust, FRPAA, EC, plus individual
institutional self-archiving mandates: CERN, Southampton, QUT, Minho).

But the deposits will not be done in one global CR, nor in a CR like
Arxiv for each discipline or combination of disciplines. With the advent
of the OAI protocol, all IRs and CRs are interoperable, and since the
research institutions themselves are the primary research providers,
with the direct interest in maximising the uptake and usage of their own
research output, the natural place for them to deposit their own output is
in their own IRs. Any central collections can be gathered via OAI harvesting.
Institutions are also best placed to monitor and reward compliance with
self-archiving mandates, both their own institutional mandates and those
of the funders of their institutional research output.

Arxiv has played an important role in getting us where we are, but it is likely
that the era of CRs is coming to a close, and the era of distributed,
interoperable IRs is now coming into its own in an entirely natural way, in
keeping with the distributed nature of the Net/Web itself.

Stevan Harnad
Received on Wed Oct 04 2006 - 23:45:31 BST

