Re: How to compare research impact of toll- vs. open-access research

From: Stevan Harnad <>
Date: Mon, 8 Sep 2003 20:59:15 +0100

This is a reply to my friend David Goodman. The library community is a
great ally to the research community in having been the first to sound
the alarm about access limitations on the refereed journal literature
(because of the library's growing budgetary problems in serials
acquisition). That first wake-up call will perdure to the library
community's lasting credit when the history of the open-access movement
is written.

However, it will also have to be noted that, having sounded the alarum
on the basis of serials budgetary problems, the library community
sometimes failed to realize the much larger and more important problem
that their own hardships had uncovered: the problem of research impact --
the uptake, usage, and application of research, by and for researchers,
as well as for the research institutions, research funding agencies and
the tax-payers who fund them.

The library community still tends to think of "usage" data in terms
of comparing journals, and how this relates to and justifies serials
expenditrures, rather than as a part of the far more general phenomenon
of research impact: Open access is vastly more important than this,
and its importance lies elsewhere.

The relevant comparisons that need to be made there are not concerned
with an institution's *own* use of its *own* bought-in resources (the
library perspective), nor about which journals are being used more by
other institutions, to decide what to buy and what to cancel (again the
library perspective). The comparisons concern the *world's* use of the
institution's own research output.

David makes the valid point, however, that because the Elsevier/BMC
comparisons do not equate subject-matter or journal-quality (and possibly
also not publication date), they could be misleading, and I agree.

Now the commentary on David's posting:

> On Sat, 6 Sep 2003, David Goodman wrote:
> Citation data, which is all that is now available, can indeed be expected
> to roughly correlate with direct journal use data.

The real underlying interest here is *not* concerned with (incoming)
*journal*-use data, but with (outgoing) *article*-use data. It simply
happens that the estimate of and comparison between average open-access
article-use and average toll-access article-use is here being made on the
basis of average open-access journal and toll-access journal statistics.

Now citations would certainly have to have *some* correlation with
usage: Access, after all, is a necessary condition for citation (one
hopes!). But for that, every citation would have to be correlated with
at least one download. Since download figures are far bigger than
citation figures, that would have left room for a correlation
coefficient that was positive, yet near zero.

But that is not what we find empirically. The correlations are quite big,
and range from .3 to .6 or higher, and seem to vary somewhat with field and
subfield. I don't think we would necessarily have expected that, a priori.

> Some commercial
> publishers decry the validity of citation data, but journal use data will
> not be available until publishers are willing to disclose them for each
> of their titles--to the best of my knowledge, no commercial publisher is.

According to Peter Suber's posting, the Biomed Central and Elsevier
download data were provided by the publishers (and they are both commercial

But that is neither here nor there. The trend that Peter inferred from
the data concerned the (average) open-access vs. toll-access *article*
(it was not about relative journal impact factors, particularly).

I then went on to point out -- based on the usage/citation
correlations we found and reported for ArXiv -- slide 11 of --
using Tim Brody's
and usage/citation correlator
that the apparent 89-fold download-advantage of the average open-access
journal article compared to the average toll-access journal article
could be used to estimate the resultant citation impact too.

I also noted (as did Peter) that the download data are probably
incomplete -- but that the incompleteness actually reflected an
underestimate of the download advantage for the open-access journals
(PubMedCentral, not being, alas, OAI-compliant, was left out). It is
possible that Elsevier too underestimated its own download usage, but
certainly not because of a wish to conceal it, as it was Elsevier that
was freely revealing its download statistics here.

Now it may be true that publishers don't disclose their full usage
statistics to libraries, for fear that the figures might be
unflattering, relative to other journals (and might encourage
cancellations). But it should be clear that nothing of that sort
was at play here.

> However, the gross comparison between publishers using either citations
> or use provides information of limited applicability.

There was no interest whatsoever in comparing *publishers* here! What
was being compared was open-access vs. toll-access impact, compared in
the past using papers that were and were not
self-archived -- by Steve Lawrence, for computer science --
and now extended to papers that appeared in open-access vs. toll-access
journals (as estimated from BMC and Elsevier data, respectively).

Neither the motivation nor the message had anything much to do with the
institutional library's usual interest in comparing usage statistics
between journals!

> The average citation rate of articles in different subject fields is
> higher in biomedicine than in many other subjects, and so should be the
> journal use.

This is a good point, and the comparison should be redone using only the
biomedical subset of the Elsevier data, vs. the BMC data (all
biomedical): Jan (Velterop) or Peter: is there any way you can answer this

> The average citation rate of recently published articles
> is considerably higher than older ones, and so should be the use.

I think this may be irrelevant, as the time-bases for the two estimates
were the same. (Jan? Peter?)

> Based on citation data, I would expect that the factor would be
> considerably less, but still impressive, if limited to the fields and
> years in which BMC publishes. There are very highly cited journals from
> all types of publishers in every scientific field, which indicates that
> the quality of the individual journal can overcome whatever handicap
> may be caused by the publisher.

Yes, journals differ systematically, by field as well as in quality,
and both usage and citation counts will no doubt co-vary with those
differences. The bulk comparison between the Elsevier's usage stats
and BMC's usage stats might be misleading, because the average BMC
journal might be in a higher-use field, or its average quality might be
higher. We are actually looking at the effects of two variables at once:
journal quality (hi/lo) and access (open/toll).

Although is possible to equate journals for subject matter, it
is hard to equate them for quality (and obviously journal citation
impact itself cannot be used to equate them, otherwise it obviates the
open-access/toll-access comparison!). Hence I too hope that our own
study -- comparing citation counts *within* the same (toll-access)
journal and year, between those that are and are not self-archived
by their authors -- will be able to provide less noisy estimates of the
impact advantage provided by open access.

> The most the present data indicates is that open access does not
> necessarily handicap a journal, and that BMC publishes at least some
> good journals. Both are important results, but more general conclusions
> about the current state of publishing, or more exact statements about
> any particular journal publisher or type of publisher, are not warranted.

I think it might be rather more positive than that! But I agree that --
because of the field and quality-equation problem, it will not be
open-access versus toll-access journal comparisons but self-archived
vs. non-self-archived comparisons *within* the same toll-access journals
that will give the most accurate and unequivocal estimates.

> I eagerly anticipate the paired comparisons now in progress--even though
> they will still necessarily be limited to citations.

Until and unless the research community finds a better metric for
measuring research impact, citations (and their correlates: downloads,
co-citations, co-semantics, etc.) will have to do! (This is rather like
saying one looks forward to the weather report, even though it will be
necessarily limited to temperature, pressure, probability of
precipitation, etc.! The *quality* of the weather can only be known by
experiencing it directly, and that is what meteorology is trying to spare
us the trouble of having to do for ourselves, in advance, in every

Stevan Harnad

NOTE: A complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online is available at
the American Scientist September Forum (98 & 99 & 00 & 01 & 02 & 03):

Discussion can be posted to:
Received on Mon Sep 08 2003 - 20:59:15 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:03 GMT