Re: Estimates on data and cost per department for institutional archives?

From: Stevan Harnad <>
Date: Mon, 8 Dec 2003 17:06:52 +0000

On Mon, 8 Dec 2003, JQ Johnson wrote:

> At most institutions the average annual peer-reviewed production [per
> author is]... minuscule, even if you are successful in getting widespread
> adoption by your faculty...
> Let's say you got total buy-in...
> [A]lthough the cost of storage and server is minimal, the cost of archival
> is potentially very large.
> If you agree with Stevan you don't care much about long-term access.
> We don't agree...

There is a deep misunderstanding here, and its elements are all there in the
foregoing paragraph. The critical phrase is:

    "if you are successful in getting widespread adoption by your
    faculty... Let's say you got total buy-in..."

This assumption is the logical equivalent of: "Let's say you win the
lottery..." [and then going on to describe how you will spend the

There is also an element of embalming Peter in order to preserve Paul!

(1) Today, as we speak, the canonical digital texts of the articles we
are talking here about preserving are being sold by publishers and bought
by libraries and inaccessible to most because of the access-toll-barriers.

(2) The *contents* of these texts are the ones to which we seek to
provide free access for those researchers whose institutions cannot
afford the access-tolls.

(3) So we (the authors and their institutions) provide a *supplementary*
text, a home-brewed version of the canonical one that is still being
bought and sold by the publishers and libraries, and we self-archive
it in our institutional archives.

(4) Or rather, we *could* self-archive it, and we *should* self-archive
it, but alas most of us still do not self-archive it.

(5) In other words, we are still far from having met the "widespread
adoption/total buy-in" precondition that is blithely being assumed in
the above passage!

(6) More important, one of the reasons we have not met those preconditions
is that faculty think they have enough burdens already; and libraries
certainly have enough burdens too.

(7) So it looks quite unlikely that what will get faculty to just go
ahead and do it will be to add or even mention any further burdens
(viz. the putative "preservation" burden).

(8) What we need to do is lighten up and focus on the task.

(9) And that task is getting more faculty (*all* faculty) to provide
open-access *now* (by self-archiving a supplementary version of each
of their peer-reviewed publications in their institutional archives).

(10) Are we sloughing off a substantive responsibility in focusing
solely on immediate open-access provision? Is the long-term fate of these
(mostly non-existent) supplementary versions a real issue?

(11) No, the real issue is not the long-term fate of non-existent
supplements to the canonical versions (the real ones with the real
preservation burden); the real issue is the non-existence of the immediate
open access to all peer-reviewed publications that those supplements
would have provided.

(12) if they were not being held back by (among other things) spurious
preservation worries!

(13) Spurious not just because self-archived supplements are merely the
copies and not the originals -- those self-same originals that would
have their (real) preservation burdens regardless of whether or not
the non-existent supplements about whose preservation we are worrying,

(14) But spurious also because what little supplementary self-archiving
has been done across the past 12 years (e.g., the 250,000 papers growing
in the Physics ArXiv since 1991) is all alive and well and fully as
openly-accessible today as it was when it was first self-archived 12
years ago, thank you very much!

(15) So it does not look as if even the immediate open-access windfall,
supplementing the toll-access canon (with its preservation burden) is
at any imminent risk of being snatched back from us, in those few cases
where we have actually troubled to provide the open access.

(16) What will follow from successfully winning the lottery is a matter for
pure speculation.

(17) But should it turn out that once the supplementary corpus -- and hence
open access to the entire refereed literature -- is complete, it forces the
providers of the toll-access corpus to cut costs and downsize in such a way that
they are no longer providing the digital texts of the primary corpus (but only the
peer-review and certification), offloading it instead, from that day onward,
onto the network of institutional open-access archives.

(18) Starting then, and only then, on that happy day, do we need to
begin worrying about the long-term preservation of what until then was merely
a supplementary corpus, but, from that day forward, also takes over the
burden of being the primary, canonical corpus.

(19) Before that, though, we need to win the lottery.

(20) And the lottery will not be one by fretting or fantasizing about
how you will will either preserve or spend your non-existent winnings.

Of course, the library community is in the profession of assuming
responsibility for the preservation of contents, whether analog or
digital. And their perennity horizons are far, far vaster than just
the annual 2.5 million articles that appear in the planet's 24,000
journals. But it is important that they not let their ex officio
responsibility for preservation Writ Large get in the way of the winning
of the lottery here: that they not subsume the non-existent preservation
burden for this non-existent supplementary open-access corpus under their
preservation efforts on behalf of many other kinds of contents, including:

> [The] natural extension... to collect[ing] supporting materials
> for... preprints [such as] multi-TB dataset[s] in some fields such
> as astronomy or biology. It only takes one such large dataset to
> completely blow away any space calculations based only on collecting
> the paper-publishable text.

May I make a suggestion? Reckon the space separately, and don't burden the empty
open-access-article shelves with the altogether distinct preservation and capacity
demands of supporting datasets:

    "Refereed Research Archiving and Data Archiving"

Or, to put it another way, don't make the purchase of an open-access lottery
ticket contingent on buying into a data-archiving lottery ticket too!

> Even if you are collecting just
> preprints and theses, the size estimates depend on how you are handling
> acquisition of multimedia materials; if you collect theses in dance you
> might have videos of performances, each of which is several GB.

Preprints and theses are extras, freebies. Our primary, secondary, and
tertiary target here is the peer-reviewed journal literature, 24,000
journals-full, 2,500,000 articles annually. A few new journals may be
multimedia. Don't worry about them. Keep the goal in focus: It's those
24,000 journals, almost all of which have both a paper edition and an
online version of it, identical except for the medium. Don't handicap
the self-archiving of the 2,499,000 yearly non-multimedia articles that
are also non-open-access with the newfound extras of the few yearly
multimedia articles in peer-reviewed journals.

> [aside: we believe that if we DON'T
> collect such unprintable items we'll never get faculty buy-in for
> Stevan's laudable goal of collecting the printable peer-reviewed works]

But *why* do you feel that? Is there any evidence for a correlation between
making an archive capable of storing uprintable items and successfully opening
the sluice-gates of self-archiving?

I can only echo the sentiments of:

    "Re: EPrints, DSpace or ESpace?"

As long as our institutional archiving efforts keep running off in all
directions, they will get nowhere. A fence should be built around our
focussed open-access provision efforts for our refereed article output,
separating them for all spurious distractions and burdens.

Stevan Harnad

