Re: Estimates on data and cost per department for institutional archives?

From: Stevan Harnad <>
Date: Mon, 8 Dec 2003 22:48:41 +0000

JQ Johnson wrote:

> please note that I have replied only to one of the lists
> you CCed. Subscribers to the other lists may like me get multiple
> copies of your posts. If not, they don't have the context of Kan's
> original question.

This entire thread is archived and accessible at:

    "Estimates on data and cost per department for institutional archives?"

> The "total buyin" scenario puts an upper bound on the size of
> the archive. That upper bound is quite small, so the reductio
> argument leads one to conclude that disk space is not an issue for
> [p]reprints. I think that's a conclusion you would agree with.

Agreed. But I am talking about eprints, which means (optionally)
unrefereed preprints and (essentially) refereed postprints.

The 1st, 2nd, and 3rd purpose of the open access movement is to provide
free, full-text online access to all of the articles (2.5 million
annually) published in the planet's 24,000 refereed journals.

That means the refereed version of the article is the essential target;
providing access to that is the essential function of the archive; and
the rest are just bonuses, which must not be allowed to get in the way
of the essentials.

> You also seem to miss the point that many of us disagree
> with you about the goals/conceptualizaion of an institutional
> repository. Slightly different conceptualizations from yours, for
> instance the decision that the repository should hold, in addition
> to surrogates for published papers, the canonical copies of working
> papers or supplementary materials, can lead to very big differences
> in one's sizing expectations and budget for the repository.

No, I think I fully understand that point. But it is in the interests of
at last providing access to the essentials -- the peer-reviewed research
literature, which is what the open-access movement is about -- that I
strongly urge those who have other objectives for their institutional
archives to pursue them separately. The point is that these other agendas
should not be allowed to get in the way of the essentials (for the open
access movement), which consists of the "surrogates for published papers"
(as you aptly put it) and not other things.

It is not that other things should not be archived! Nor even that they
should not be archived in the same archive, if that is possible. But the
constraints on the archiving of the other things should not be allowed to
hamper the archiving of the essentials in any way. That includes capacity
and preservation burdens. If the other things have those burdens, they
should be carried separately not imposed needlessly on the essentials.

> Re preservation: I also disagree with your claim that having a
> library mention preservation is likely to dissuade faculty from
> contributing their work. Quite the contrary, a committment to a
> reasonable degree of preservation is one way to sell the service,
> and in fact is a piece that we've found to be fairly appealing
> to faculty, even if it does have institutional costs that need to
> be considered.

Anything that actually generates the self-archiving of the essentials
-- reminder: refereed articles! -- is welcome. But the fact is that
there is not yet faintly enough self-archiving of refereed articles.
So if the preservation promises have served as an inducement, they
haven't yet helped much!

Far, far more likely is that preservation promises have simply
compounded the still prevalent misunderstanding about what is to
be self-archived, how, and why. The author of an article in Nature,
for example, has no worries about preservation. If you want to induce
him to self-archive, you have to demonstrate to him the advantages --
in terms of research impact -- of providing open access to his Nature
article. You will not induce him to self-archive by telling him that
if he does, you promise to provide long-term preservation for that
surrogate of the article! He is not concerned about long-term
preservation. (Why should he be any more concerned about the long-term
preservation of his Nature articles today than 10, 20, or 30 years ago?)

But I am always open to new inducements for open-access provision. If --
mirabile dictu -- preservation promises start to serve as the successful
inducement (instead of the red herring they have been till now), I will
immediately stop preaching impact and start preaching preservation! (But
I would only keep preaching at all because in reality open-access *is*
for the sake of impact, even if, for some reason, preservation should
prove to be the only language researchers understand.)

I profoundly doubt, however, that the promise of preservation will do
the trick. It's more likely to distract from the real goal, as it has
done so often in the past:

> Similarly, we find that there is demand for a convenient way to make
> supplmentary materials available to colleagues. Many journals these
> days realize there is demand for this and are offering this service
> themselves (so it's ALREADY part of the publication process), but
> not all. Again, offering support for the deposit of these materials
> is something a library can do to make it MORE likely that faculty
> will contribute the paper itself to an institutional repository,
> and hence provide the open access we both agree is important.

I agree that if an author troubles to self-archive the supplementary
materials for his article, he is more likely to self-archive the article
too. But (1) the vast majority of the yearly 2,500,000 articles do not
have any supplementary materials and (2) chances are that for the minority
of articles that do have supplementary materials that the author wants to
self-archive (*and* the journal cannot archive them) the authors are
already self-archiving their articles too.

So I don't think this tail can or should wag the dog.

>JQJ> The key issue here may be:
>JQJ> [aside: we believe that if we DON'T collect such unprintable
>JQJ> items we'll never get faculty buy-in for Stevan's laudable
>JQJ> goal of collecting the printable peer-reviewed works]
>SH> But *why* do you feel that?
> We need more data, but anecdotally, our experience has been that
> those faculty who are hot to self-archive their preprints are
> already doing so, and that arXiv and RePEc are meeting most of the
> demand.

That's probably true, though it leaves out an anarchic multi-disciplinary
body of articles self-archived by authors on their own websites (as discovered
by harvesters such as citeseer in computer science, already twice as big as
the Physics ArXiv several years ago).

But I agree that it is those who are *not* yet self-archiving (i.e.,
nearly 90%) who are the targets. I just doubt that the lure of either
preservation or data-archiving is going to draw that 90% on board!
(Whereas I do think that powerful objective demonstrations of the
impact-enhancing power of open-access provision will: for the researchers,
and, even more important, for their institutions and funders, who
wield the publish-or-perish carrot and stick that already protects
researchers from any natural tendency to just put their papers into
a desk-drawer rather than maximizing their impact by publishing them: .)

> There are very few examples of institutional repositories
> that are proving successful in collecting the types of items you
> insist on focusing on.

That is true. But there are some. And more than there used to be recently.
And we are still working on formulating that offer that researchers
won't be able to refuse!

    "Measuring cumulating research impact loss across fields and time"

> Even MIT's DSpace is getting very little
> true faculty self-archiving, with most of its growth seeming to
> come from departmental grey literature (and soon from OCI course
> materials).

I can only re-echo the sentiments of:

    "EPrints, DSpace or ESpace?"

The reason institutional archives are not getting filled
is because they are heading off in all directions:

    1. (MAN) digital collection management (all kinds of digital content)

    2. (PRES) digital preservation (all kinds of digital content)

    3. (TEACH) online teaching materials

    4. (EPUB) electronic publication (journals and books)

    5. (RES) self-archiving institutional research output (preprints,
    postprints and theses)

Instead of focussing on 5 (RES).

> Meanwhile, though, many faculty express the desire
> to have better tools for managing large data sets as part of the
> publication process.

Meanwhile, while the archives are *not* filling, and open access is *not* being
provided, and cumulating research impact loss continues to grow, there are
many other desires being expressed (and perhaps even being fulfilled).

But they have nothing to do with open access (to the refereed journal
literature)! And those sectors of the archives continue to yawn empty.

> We had a dean's retreat last month, and our
> deanlet for the social sciences made the point that her new hires
> in the social sciences are starting to imitate their physical
> science colleagues and demand more startup money; when asked what
> they say they need the money for, she reported that it is mostly
> for management and archival of their data sets.

Well, that is certainly interesting, and good news for data-archiving in the
social sciences. But what further conclusion is to be drawn from it?

> Similarly, there
> is pressure in the U.S. from the granting agencies to make the
> data that accompanies a submitted paper publicly accessible.

That pressure is very welcome, but it seems absurd for the granting
agencies to pressure for archiving the data without also pressuring
for archiving the data! (And are you sure this is all for *open-access*
data-archiving, rather than merely data-archiving?)

> And I frequently hear from faculty that they want to increase the
> impact of their work by making available supplmentary materials
> that go along with the peer reviewed paper -- data sets, survey
> instruments, supplementary statistical analyses, maps and images,
> etc. There's real unmet demand here.

A useful thing to say to faculty who have expressed the desire to increase
the impact of their peer-reviewed papers by providing open access to
supplementary materials is: Wouldn't it be a good idea to increase the
impact of your peer-reviewed papers by providing open access to your
peer-reviewed papers?

But I am not disagreeing that data-archiving is a splendid idea and
will help hasten article-archiving too!
I'm just suggesting that we shouldn't focus on trying to make the
tail wag the dog.

> I think you make a very good point that it is a large problem that
> institutional archiving efforts keep running off in all directions.
> However, although I think there is definitely room for institutional
> strategies that are monomaniacal in their focus, there's also room
> for strategies that take a very different direction. For instance,
> consider MIT's DSpace, or Ohio State's KnowledgeBank. The important
> thing is that any given institution be clear in its goals, and that
> we recognize that the precise statement of those goals will imply
> particular implementation strategies and hardware/technology/budget
> requirements (the question that Min-Yen Kan originally raised).

There are many digital things institutions can and should archive. One
particularly important thing is their refereed research output. The
self-archiving of that particularly important thing is going particularly
slowly (relative to its importance), mainly (I believe) because the
research community has not yet grasped what a strong direct causal
connection there is between research access and research impact. Their
grasp of that connection will not be hastened by mixing it up with the
archiving of all kinds of other digital things -- and especially if
the archiving of those other digital things has further constraints
and liabilities (such as preservation, large-scale data-archiving,
and multimedia) that spill over onto the archiving of refereed research
output, which does not have those further constraints and liabilities,
and needs more momentum rather than dead weight.

Stevan Harnad
Received on Mon Dec 08 2003 - 22:48:41 GMT

