Re: Consultation Draft - Study on Preservation of eprints

From: Stevan Harnad <>
Date: Tue, 10 Jun 2003 20:15:18 +0100

> On Tue, 10 Jun 2003, Neal BEAGRIE wrote:
> Dear Stevan
> Many thanks for this posting in response to the consultation draft. Some
> thoughts on issues raised:
> 1. No-one doubts the overall importance of expanding the content held
> in institutional respositories. The draft report itself points out
> there are only around 5,000 eprints estimated currently to be held
> in UK institutional respositories. This is clearly a major issue and
> significant cultural change is still needed to change this position.

Dear Neal,

On this we agree completely. The question is, what will help facilitate
that cultural change, and bring it about as soon as possible (as it is
already overdue!).

> 2. The consultation draft is one of a series of reports commissioned
> under the JISC Continuing Access and Digital Preservation Strategy. The
> archiving of subscription e-journal journals and the issues surrounding
> the preservation of and ongoing access to, the primary corpus of published
> literature are considered in parallel.

There are in fact not two but three parallel issues there:

(a) the archiving of subscription-based e-journals (part of electronic
collection management, classified as II "MAN", below)

(b) preservation of all journals (mostly now hybrid, with both a paper
and an online edition, classified as III "PRES", below)

(c) ongoing access to all (research) journals (for those whose
institutions do not have subscription access to them, classified as I
"RES", below).

Yes, these are parallel, but they must be faithfully kept distinct,
because the solution for one is definitely not the solution for another.

> 3. Although the majority of preservation effort is clearly needed on the
> published corpus, the consultation draft for eprints is surely right to
> point to the preservation issues which are likely to arise in time for
> institutional repositories.

I am afraid this mixes up the three again. Institutional repositories
have the problem of managing their own online collections of whatever
they might have (MAN). To a great extent, managing these collections
also entails preserving them (PRES). This preservation burden, in the
case of the primary, subscription-based journal literature, is one that
is perhaps shared by university libraries and the publishers of the
journal literature.

(I say "perhaps" because I know that often the publishers are not sharing
the preservation burden, feeling that that is traditionally a library
responsibility, not a publisher responsibility. On these matters I have
no views, except to point out that they have *nothing* to do with the
ongoing-access problem, and should be handled separately, however it is

Self-archived eprints (of which there are so few) do *not* face any
preservation problem for the following reasons:

(i) The long-term preservation problem is currently 100% on the primary
corpus (and whoever is handling that, and how).

(ii) For the short-term (decades at least, as the still-with-us and
still-100%-useable Physics ArXiv and others demonstrate since at least
1991) there is no preservation problem for the secondary, self-archived,
open-access corpus, which merely *duplicates* the primary corpus,
for ongoing-access purposes, for those whose institutions cannot afford
the subscription-based access to the primary versions. The *only*
short-term problem for this back-up corpus is its still minuscule
size! This is the *content* problem (or, better, the *access* problem)
which has nothing to do with preservation issues, and should not
be weighted down with any. It needs facilitation, not further (and
irrelevant) loads.

> 4. Preservation in this context is a means to an end - ensuring continuing
> access to cited research output (published or not).

Preservation of what? With the primary corpus, the answer is clear. But
with the secondary, duplicate corpus, meant only to remedy access
problems (and doing so quite brilliantly for well over a decade now, for
the little that has been self-archived so far) the "continuing access"
problem is rather different, isn't it, from the "continuing access"
problem for those institutions that have and can afford the primary

To put it another way, suppose there were no self-archived versions at
all, and the only access was subscription-access. We would still be
facing all there is of the PRES problem: How to ensure that access to
the subscription-based corpus, for those who have it now, continues to
be had next year, and the next. Fair enough. But nothing whatsoever to
do with the problem of the nonexistent access of those who cannot afford
the subscription access! It is for *them* that the duplicate self-archived
versions are created. And their 1st, 2nd, and 3rd worry is still access to
that corpus *today*, because for most of the 2,000,000 annual articles
appearing in the 20,000 subscription journals today, there *is* no
self-archived version. So "continuing access" is not only moot for this
nonexistent secondary corpus, but it is also beside the point for the
little of it that exists so far. There is no need to fret about
continuing-access to the self-archived corpus: Sort out the preservation
problem for its primary incarnation, and meanwhile let those who are
concerned with secondary access worry about increasing that access. When
all of the contents of the 20K have been self-archived and made openly
accessible, *then* we can see whether there is some way it can help
solve the preservation problem for the primary corpus (if it has not
already been solved by then). Not before. For access, the only problem
is access today.

> 5. The report considers issues which may lead to the retention(or
> withdrawal) of eprints over time. For related material such as e-theses
> mentioned in your (RES) category the need for continuing access and
> preservation will always be present.

First the theses: Here too, there is a primary corpus, with the true
preservation burden: How are/were theses preserved even when no one
self-archived them? And then there is the secondary corpus, for access.
Exactly the same story as above.

Now about self-archived versions *other* than the self-archived final,
refereed, revised, accepted, published journal version. I would say it
is premature to fuss too much about those. The culture of self-archiving
has not yet established itself, whether it be the self-archiving of
unrefereed preprints or refereed postprints. In the scheme of things,
the urgency of getting all 2M annual postprints self-archived and openly
accessible *vastly* outweighs the problem of self-archiving or preserving
the unrefereed prior drafts. Not that there is no point having and
saving preprints, and archiving them for ever. That would be desirable; but
it is completely eclipsed at the moment by the access problem for the
postprints. Moreover, again the preprint corpus is minuscule. So whether
it comes and goes is not the issue.

See the American Scientist Forum thread on
"Eprint versions and removals"
(and its predecessor threads).

The reality there is quite illuminating: We would "love" to be able to
implement and enforce a strict policy that no self-archived draft may
subsequently be removed from an Eprint Archive. It would certainly be
the best for the scholarly record if publicly accessible documents that
users had read, used and cited, didn't vanish thereafter. But the fact is
that one reads, uses and cites unrefereed literature at one's own risk
anyway. It is not only the text that might go up in smoke subsequently,
but its content, if it does not manage to meet the standards of peer

Never mind; the real constraint is this: The reason Eprint Archives
cannot at this time impose a draconian "no removals permitted" policy on
their self-archivers is that the self-archivers are still so few, and
skittish. This is not time to slap their wrists or implant the fear of
god in their heads. It's a time to encourage them to do what none of
them are yet used to doing: self-archive their refereed postprints. If
they also self-archived pre-refereeing preprints, that's well and good,
but don't make it into a handicap, or a deterrent, by staying their
already trembling hands with warnings that "if you hit the entry key on
this draft, it's forever!".

(In reality, it *is* forever, for even if the file is later removed from
the mother-archive, having been openly accessible even for a while, it
may well have been downloaded, harvested, and cached all over the
planet, and it might even have had a probity-based, time-stamped
"snapshot," bit-perfect, stored of it somewhere. But there's no point
even talking about *that* now, when it too could only be seen as
a deterrent before the requisite cultural change has taken place.

> 6. You note all five aims listed in your email are worthwhile and
> important - but most are not urgent ie needing action now. Although
> preservation is a long-term challenge there are actions which are best
> taken at the earliest possible stage.

For the primary corpus. But not for the frail secondary corpus, where
these measures can only serve as further deterrents and retardants at a
time when facilitators are the only thing needed!

> Institutional approaches to IPR is one and this is being addressed
> elsewhere in the FAIR programme because of its impact across the board
> on repositories.

And I think casting the access problem and self-archiving in anything
faintly resembling "Intellectual Property Rights" terms is just asking
for trouble (and confusion, and still more delay): Self-archiving is
not an IPR issue! The sole IP question -- to which we know the answer --
is whether it is ok for the author of a peer-reviewed journal article
to self-archive his own final, refereed ("vanilla") draft. The short
answer is "Yes" (and that's really all there is to it: it was certainly
enough for the authors of the 250,000 articles self-archived by the
physicists lo these dozen-odd years, and for many times more authors in
other disciplines who have been doing it on their own websites in for
at least as long).

But if would-be self-archivers want more details, there's the other
JISC project, Romeo:
and the self-archiving FAQ:

But the less that authors get involved in digital IPR matters, the
better. Their time is far better spent self-archiving!

> Another is the
> issue of capturing the technical metadata which can support long-term
> management.

Again: Management of what? For the secondary, self-archived, open-access
versions of the refereed corpus, the minimal OAI-protocol is already
more than satisfactory. (For the primary corpus, nolo contendere.)

To mix the two is just to blur the picture, a picture that urgently
wants focusing, not blurring!

> As the report points out it is possible most if not all of
> this can be automated at deposit and its recommends exploring interfaces
> between institutional repository software (eprints, DSpace etc) and file
> format recognition software.

All fine, as long as the secondary ongoing-access problem, RES, is not
conflated with any of the other other four, MAN, PRES, TEACH and EPUB. It
is special, different, and very urgent.

> Overall I think the consultation draft is carefully balanced. It is
> not attempting to say there are preservation problems to solve before
> repositories can be filled. It is pointing out the role that preservation
> will play as these repositories grow and the steps that can be taken
> to address issues which become far more problematic over time if not
> addressed at an early stage. It clearly recognises the overall importance
> of growing content in repositories.

The more I think of it, though, it is not at all clear that it is
beneficial to see the kinds of repositories we need for RES as having
anything at all to do with the kind we need for MAN, PRES, TEACH and
EPUB. Is it even such a good idea to treat them all as one repository? One
of the lessons we have learned, and the powers we have gained, from
the OAI-protocol is that we need not think of big central repositories
(analogous to libraries) at all any more. OAI-interoperability makes it
much more sensible to think in terms of small, distributed archives,
unified only by the glue of interoperability. Some kinds of archives
may require much finer-grained metadata. Fine. The eprint archives
consisting of secondary self-archived versions of the primary refereed
journal corpus do not need such fine-grained metadata at this time.
Nothing much more than author, title, journal, year (plus a few more, as
in the OAI-protocol) are good enough to serve the immediate and pressing
access needs we have in this area (in face of the content that we lack).

So let the Eprint archives be coarsely interoperable, via the
OAI-protocol, with all the other archives, including many more at the same
university. No need to try to force them into the same Procrustean
meta-bed! Let the fine-grained metadata be worked out for PRES, MAN,
EPUB (and perhaps TEACH). They have the time. But (back-door, vanilla)
*ongoing-access* (to the peer-reviewed corpus) is urgently needed, now.

> I hope it is a report which will be widely read by emerging institutional
> repositories and look forward to comments from colleagues in the FAIR
> programme in due course.

I hope it will advance the finer-grained and less-pressing needs of II-V
without retarding the courser-grained and much more pressing needs of I!


> *********************************************************************
> Neil Beagrie JISC Digital Preservation Focus
> Programme Director Secretary, Digital Preservation Coalition
> JISC London Office, Tel/Fax/Voicemail :+44 (0)709 2048179
> King's College London email:
> Strand Bridge House url:
> 138 - 142, The Strand,
> London WC2R 1HH
> ************************************************************************
> -----Original Message-----
> From: Stevan Harnad [mailto:harnad_at_ECS.SOTON.AC.UK]
> Sent: Fri 06/06/2003 20:23
> Cc:
> Subject: Re: Consultation Draft - Study on Preservation of eprints
> The institutional eprint repository movement would benefit greatly
> from clearly separating the 5 quasi-independent aims that currently
> constitute its very mixed agenda. All 5 aims are worthwhile and important,
> but only the first is urgent, and it is the heart of the challenge for
> filling institutional repositiories with university research output for
> the sake of maximizing its impact by maximizing access to it:
> The 5 distinct aims for institutional repositories
> I. (RES) self-archiving institutional research output (preprints,
> postprints and theses)
> II. (MAN) digital collection management (all kinds of digital content)
> III. (PRES) digital preservation (all kinds of digital content)
> IV. (TEACH) online teaching materials
> V. (EPUB) electronic publication (journals and books)
> As long as we keep blurring or mixing these 5 distinct aims, the first
> and by far the most pressing of them, RES -- the filling of university eprint
> archives with all university research output, pre- and post-peer-review,
> in order to maximize its impact through open access -- will be needlessly
> delayed (and so will any eventual relief from the university serials
> budget crisis).
> Perhaps the two most counterproductive of the conflations among these
> five distinct aims has been that between I and III (research
> self-archiving, RES, and digital preservation, PRES) and that between
> I and V (research self-archiving, RES, and electronic publication,
> EPUB).
> The RES/PRES mix-up, much discussed in the American Scientist Forum,
> can easily be seen to be a needless and misleading conflation once we
> recall that insofar as the peer-reviewed research literature is
> concerned, the current preservation burden is on its primary corpus,
> which is the published literature (online and on paper). The much-needed
> filling of university research-output archives is a *supplement* to this
> primary corpus, for the purpose of maximizing its impact by maximizing
> access to it; it is not a *substitute* for it. It is simply a mistake
> and a needless retardant on the filling of the university research output
> archived to imply that there are preservation problems to solve before
> they can be filled.
> The RES/EPUB mix-up is really two mixups. The first is the conflation of
> self-archiving with self-publishing: The urgent archive-filling challenge,
> RES, concerns the self-archiving of peer-reviewed, *published* research
> output. Again, this is a *supplement* to publication, for the purpose of
> maximizing its impact by maximizing access to it; it is not a *substitute*
> for it.
> The second RES/EPUB mix-up has to do with university e-publishing
> ambitions (perhaps along the lines of High-Wire Press-Hopes!). It is
> fine to have these ambitions, but they should not be conflated in any
> way with the completely independent and urgent aim of self-archiving
> the university's peer-reviewed, *published* research output.
> Most of this is discussed in the thread:
> "EPrints, DSpace or ESpace?"
> See: "Enhance UK research impact and assessment by making the RAE webmetric"
> Stevan Harnad
Received on Tue Jun 10 2003 - 20:15:18 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:46:58 GMT