Re: Manual Evaluation of Algorithm Performance on Identifying OA

From: Stevan Harnad <>
Date: Thu, 12 Jan 2006 16:02:51 +0000

On Wed, 11 Jan 2006, David Goodman wrote (in liblicense-l):

> Within the last few months, Stevan Harnad and his group, and we in our
> group, have carried out together several manual measurements of OA (and
> sometimes OAA, Open Access Advantage). The intent has been to independently
> evaluate the accuracy of Chawki Hajjem's robot program, which has been
> widely used by Harnad's group to out similar measurements by computer.
> The results from these measurements were first reported in a joint posting
> on Amsci,* referring for specifics to a simultaneously posted detailed
> technical report,** in which the results of each of several manual
> analyses were separately reported.
> *
> =american-scientist-open-access-forum&D=1&O=D&F=l&P=96445)
> ** "Evaluation of Algorithm Performance on Identifying OA" by Kristin
> Antelman, Nisa Bakkalbasi, David Goodman, Chawki Hajjem, Stevan Harnad (in
> alphabetical order) posted on ECS as http: eprints/,
> From these data, both groups agreed that "In conclusion, the robot is not
> yet performing at a desirable level and future work may be needed to
> determine the causes, and improve the algorithm."

I am happy that David and his co-workers did an independent test of
how accurately Chawki's robot detects OA. The robot over-estimates OA
(i.e., it miscodes many non-OA articles as OA: false positives, or
false OA).

Since our primary interest was and is in demonstrating the OA citation impact
advantage, we had reasoned that any tendency to mix up OA and non-OA would
go against us, because we were comparing the relative number of citations
for OA and non-OA articles: the OA/non-OA citation ratio. So mixing up
OA and non-OA would simply dilute that ratio, hence the detectability
of any underlying OA advantage. (But more on this below.)

We were not particularly touting the robot's accuracy in and of itself,
nor its absolute estimates of the percentage of OA articles. There are
other estimates of %OA, and they all agree that it is roughly between 5%
and 25%, depending on field and year. We definitely do not think that
pinning down that absolute percentage accurately is the high priority
research goal at this time.

In contrast, confirming the OA impact advantage (as first reported in 2001
by Lawrence for computer science) across other disciplines *is* a high
priority research goal today (because of its importance for motivating
OA). And we have already confirmed that OA advantage in a number of
areas of physics and mathematics *without the use of a robot.*

    Brody, T. and Harnad, S. (2004) Comparing the Impact of Open Access
    (OA) vs. Non-OA Articles in the Same Journals. D-Lib Magazine 10(6).

    Harnad, S., Brody, T., Vallieres, F., Carr, L., Hitchcock, S.,
    Yves, G., Charles, O., Stamerjohanns, H. and Hilf, E. (2004)
    The Access/Impact Problem and the Green and Gold Roads to Open
    Access. Serials review 30(4).

    Brody, T., Harnad, S. and Carr, L. (2005) Earlier Web Usage
    Statistics as Predictors of Later Citation Impact. Journal of the
    American Association for Information Science and Technology (JASIST).

For the OA advantage too, it is its virtually exception-free positive
polarity that is most important today -- less so its absolute value or
variation by year and field.

The summary of the Goodman et al. independent signal-detection analysis
of the robot's accuracy is the following:

    This is a second signal-detection analysis of the accuracy of a
    robot in detecting open access (OA) articles (by checking by hand
    how many of the articles the robot tagged OA were really OA, and
    vice versa). A first analysis, on a smaller sample (Biology: 100
    OA, 100 non-OA), had found a detectability (d') of 2.45 and bias of
    0.52 (hits 93%, false positives 16%; Biology %OA: 14%; OA citation
    advantage: 50%). The present analysis on a larger sample (Biology:
    272 OA, 272 non-OA) found a detectability of 0.98 and bias of 0.78
    (hits 77%, false positives, 41%; Biology %OA: 16%; OA citation
    advantage: 64%). An analysis in Sociology (177 OA, 177 non-OA)
    found near-chance detectability (d' = 0.11) and an OA bias of 0.99
    (hits, 9%, false alarms, -2%; prior robot estimate Sociology %OA:
    23%; present estimate 15%). It was not possible from these data to
    estimate the Sociology OA citation advantage. CONCLUSIONS: The robot
    significantly overcodes for OA. In Biology 2002, 40% of identified
    OA was in fact OA. In Sociology 2000, only 18% of identified OA
    was in fact OA. Missed OA was lower: 12% in Biology 2002 and 14% in
    Sociology 2000. The sources of the error are impossible to determine
    from the present data, since the algorithm did not capture URLs for
    documents identified as OA. In conclusion, the robot is not yet
    performing at a desirable level and future work may be needed to
    determine the causes, and improve the algorithm.

In other words, the second test, based on the better, larger sample,
finds a lower accuracy and a higher false-OA bias. In Biology, the
robot had estimated 14% OA overall; the estimate based on the Goodman
et al sample was instead 16% OA. (So the robot's *over*coding of the
OA had actually resulted in a slight *under*estimate of %OA -- largely
because the population proportion of OA is so low: somewhere between 5%
and 25%.) The robot had found an average OA advantage of 50% in Biology;
the Goodman et al sample found an OA advantage of 64%. (Again, there was
not much change, because the overall proportion of OA is still so low.)

Our robot's accuracy for Sociology (which we had not tested, so Goodman
et al's was the first test) turned out to be much worse, and we are
investigating this further. It will be important to find out why the
robot's accuracy in detecting OA would vary from field to field.

> Our group has now prepared an overall meta-analysis of the manual results
> from both groups. *** We are able to combine the results, as we all were
> careful to examine the same sample base using identical protocols for both
> the counting and the analysis. Upon testing, we found a within-group
> inter-rater agreement of 93% and a between-groups agreement of 92%.
> *** "Meta-analysis of OA and OAA manual determinations." David Goodman,
> Kristen Antelman, and Nisa Bakkalbasi,
> <>

I am not sure about the informativeness of a "meta-analysis" based on
two samples, from two different fields, whose main feature is that there
seems to be a substantial difference in robot accuracy between the two
fields! Until we determine why the robot's accuracy would differ by field,
combining these two divergent results is like averaging over apples and
oranges. It is trying to squeeze too much out of limited data.

Our own group is currently focusing on testing the robot's accuracy in
Biology and Sociology (see end of this message), using a still larger
sample of each, and looking at other correlates, such as the number of
search-matches for each item. This is incomparably more important than
simply increasing the robot's accuracy for its own sake, ot for trying to
get more accurate absolute estimates of the percentage of OA articles,
because if the robot's false-OA bias were to be large enough *and* were
correlated with the number of search-match items (i.e., if articles that
have more non-OA matches on the Web are more likely to be falsely coded
as OA) then this would compromise the robot-based OA-advantage estimates.

> Between us, we analyzed a combined sample of 1198 articles in biology and
> sociology, 559 of which the robot had identified as OA, and 559 of which
> the robot had reported as non-OA.
> Of the 559 robot-identified OA articles , only 224 actually were OA (37%).
> Of the 559 robot-identified non-OA articles, 533 were truly non-OA (89%).
> The discriminability index, a common used figure of merit, was only 0.97.

It is not at all clear what these figures imply, if anything. What would
be of interest would be to calculate the OA citation advantage for each
field (separately, and then, if you wish, combined) based on the citation
counts for articles now correctly coded by humans as OA and non-OA in
this sample, and to compare that with the robot-based estimate.

More calculations on the robot's overall inaccuracy averaging across these two
fields is not in and of itself providing any useful information.

> (We wish to emphasize that our group's results find true OAA in biology at
> a substantial level, and we all consider OAA one of the many reasons that
> authors should publish OA.)

It would be useful to look at the OAA (OA citation advantage) for the
Sociology sample too, but note that the right way to compare OA and non-OA
citations is within the same journal/year. Here only one year is involved,
and perhaps even the raw OA/non-OA citation ratio will tell us something,
but not a lot, given that there can be journal-bias, with the OA articles
coming from some journals and the non-OA ones coming from different
journals: Journals do not all have the same average citation counts.

> In the many separate postings and papers from the SH group, such as ****
> and ***** done without our group's involvement, their authors refer only
> to the SH part of the small manual inter-rater reliability test. As it was
> a small and nonrandom sample, it yields an anomalous discriminability
> index of 2.45, unlike the values found for larger individual tests or for
> the combined sample. They then use that partial result by itself to prove
> the robot's accuracy.
> **** such as "Open Access to Research Increases Citation Impact" by
> Chawki Hajjem, Yves Gingras, Tim Brody, Les Carr, and Stevan Harnad
> *****: "Ten-Year Cross-Disciplinary Comparison of the Growth of Open
> Access and How it Increases Research Citation Impact" by 5. C. Hajjem, S.
> Harnad, and Y. Gingras in IEEE Data Engineering Bulletin, 2005,

No one is "proving" (or interested in proving) robot accuracy! In our
publications to date, we cite our results to date. The Goodman et al. test
results came out too late to be mentioned in the ***** published article,
but they will be mentioned in the **** updated preprint (along with the
further results from our ongoing tests).

> None of the SH group's postings or publications refer to the joint report
> from the two groups, of which they could not have been ignorant, as the
> report was concurrently being evaluated and reviewed by SH.

Are Goodman et al. suggesting that there has been some suppression
of information here -- information from reports that we have co-signed
and co-posted publicly? Or are Goodman et al. concerned that they are
not getting sufficient credit for something?

> Considering that both the joint ecs technical report ** and the separate
> SH group report***** were both posted on Dec .16 2005, we have here
> perhaps the first known instance of a author posting findings on the same
> subject, on the same day, as adjacent postings on the same list, but with
> opposite conclusions.

One of the postings being a published postprint and the other an unpublished
preprint! Again, what exactly is Goodman et al.'s point?

> In view of these joint results, there is good reason to consider all
> current and earlier automated results performed using the CH algorithm to
> be of doubtful validity. The reader may judge: merely examine the graphs
> in the original joint Technical Report; **. They speak for themselves.

No, the robot accuracy tests do not speak for themselves. Nor does the
conclusion of Goodman et al's preprint (***) (which I am now rather beginning
to regret having obligingly "co-signed"!):

    "In conclusion, the robot is not yet performing at a desirable level
    and future work may be needed to determine the causes, and improve
    the algorithm."

What *I* meant in agreeing with that conclusion was that we needed
to find out why there were the big differences in the robot accuracy
estimates (between our two samples and between the two fields). The
robot's detection accuracy can and will be tightened, if and when it
becomes clear that it needs to be, for our primary purpose (measuring and
comparing the OA citation advantage across fields) or even our secondary
purpose (estimating the relative percentage of OA by field and year),
but not as an end in itself (i.e., just for the sake of increasing or
"proving" robot accuracy).

The reason we are doing our analyses with a robot rather than by hand is
to be able to cover far more fields, years and articles, more quickly,
than it is possible to do by hand. The hand-samples are a good check on
the accuracy of the robot's estimates, but they are not necessarily a
level of accuracy we need to reach or even approach with the robot!

On the other hand, potential artifacts -- tending in opposite directions
-- do need to be tested, and, if necessary, controlled for (including
tightening the robot's accuracy):

    (1) to what extent is the OA citation "advantage" just a non-causal
    self-selection quality bias, with authors selectively self-archiving
    their higher-quality, hence higher citation-probability articles?

    (2) to what extent is the OA citation "advantage" just an artifact
    of false positives by the robot? (because there will be more false
    positives when there are more matches with the reference search from
    articles *other* than the article itself, hence more false positives
    with articles that are more cited on the web, which would make the
    robot-based outcome not an OA effect, and circular)

A third question (not about a potential artifact, but about a genuine
causal component of the OA advantage) is:

   (3) to what extent is the OA advantage an Early (preprint) Advantage

For those who are interested in our ongoing analyses, I append some
further information below.

Stevan Harnad

Chawki: Here are the tests and controls that need to be done
to determine both the robot's accuracy in detecting and estimating
%OA and the causality of the observed citation advantage:

(1) When you re-do the searches in Biology and Sociology (to begin with:
other disciplines can come later), make sure to (1a) store the number as
well as the URLs of all retrieved sites that match the reference-query and
(1b) make the robot check the whole list (up to at least the pre-specified
N-item limit you used before) rather than the robot's stopping as soon as
it thinks it has found that the item is "OA," as in your prior searches.

That way you will have, for each of your Biology and Sociology ISI
reference articles, not only their citation counts, but also their
query-match counts (from the search-engines) and also the number and
ordinal position for every time the robot calls them "OA." (One item
might have, say, k query-matches, with the 3rd, 9th and kth one judged
"OA" by the robot, and the other k-3 judged non-OA.)

Both the number (and URLs) of query-matches and the ordinal position of
the first "OA"-call and the total number and proportion of OA-calls
will be important test data to make sure that our robot-based OA
citation advantage estimate is *not* just a query-match-frequency and/or
query-match frequency plus false alarm artifact. (The potential artifact
is that the robot-based OA advantage is not an OA advantage at all, but
merely a reflection of the fact that more highly cited articles are more
likely to have online items that *cite* them, and that these online
items are the ones the robot is *mistaking* for OA full-texts of the
*cited* article itself.)

(2) As a further check on robot accuracy, please use a subset
of URLs for articles that we *know* to be OA (e.g., from PubMed
Central, Google Scholar, Arxiv, CogPrints) and try both the search-engines
(for % query-matches) and the robot (for "%OA") on them. That will give
another estimate of the *miss* rate of the search-engines as well
as of the robot's algorithm for OA.

(3) While you are doing this, in addition to the parameters that
are stored with the reference (the citation count, the URLs for every
query-match by the search, the number, proportion, and ordinal position
of those of the matches that the robot tags as "OA"), please also store
the citation impact factor of the *journal* in which the reference
article was published. (We will use this to do sub-analyses to see
whether the pattern is the same for high and low impact journals, and
across disciplines; we will also look at it separately, for %OA among
articles at different citation levels (1, 2-3, 4-7, 7-15, 16-31, 32-63,
64+), again within and across years and disciplines.)

(4) The sampling for Biology and Sociology should of course be based
on *pairs* within the same journal/year/issue-number: Assuming that
you will be sampling 500 pairs (i.e., 1000 items) in each discipline
(1000 Biology, 1000 Sociology), please first pick a *random* sample of 50
pairs for each year, and then, within each pair, pick, at *random*, one
OA and one non-OA article per same issue. Use only the robot's *first*
ordinal OA as your criterion for "OA" (so that you are duplicating the
methodology the robot had used); the criterion for non-OA is, as before:
none found among all of the search matches). If you feel you have the
time, it would also be informative to check the 2nd or 3rd "OA" item if
the robot found more than one. That too would be a good control datum,
for evaluating the robot's accuracy under different conditions (number
of matches; number/proportion of them judged "OA").

(5) Count also the number of *journals* for which the robot judges that
it is at or near 100% OA (for those are almost certainly OA journals
and not self-archived articles). Include them in your %OA counts,
but of course not in your OA/NOA ratios. (It would be a good
idea to check all the ISI journal names against the DOAJ OA journals
list -- about 2000 journals -- to make sure you catch all the OA
journals.) Keep a count also of how many individual journal *issues*
has either 100% OA or 0% OA (and were hence eliminated from the OA/NOA
citation ratio). Those numbers will also be useful for later analyses and

With these data we will be in a much better position to estimate
the robot's accuracy and some of the factors contributing to the OA
citation advantage.
Received on Thu Jan 12 2006 - 17:01:09 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:48:11 GMT