Re: Interoperability - subject classification/terminology

From: Stevan Harnad <>
Date: Fri, 14 Nov 2003 17:07:19 +0000

On Thu, 13 Nov 2003, Franklin, Rosemary (franklra) wrote:

> Generally you are searching in natural language, depending on the fields
> tagged and how the file is organized. Portals such as the HUMBUL site and
> others organized around broad subject areas are value-added OAI searching
> and have controlled vocabulary added, or they are in the process of adding.

I would like to make a bet about values that will prove to be worth and not worth
adding to a full-text corpus of refereed research journal articles. (Note that
this bet pertains *only* to the refereed journal article corpus, but that does
include all disciplines, including the humanities):

Until and unless XML tagging of the full-texts themselves prevails -- a
desirable outcome that is largely independent of the urgent goal of open
access -- nothing will come even close to matching (let alone beating)
the power of boolean search over the inverted full-texts, google-style
(but restricted to the OAI-compliant domain).

Please remember that most researchers currently search their abstracts databases and
their toll-access journal content databases without the help of any subject
classification taxonomies. This will continue to be the case for the open-access
full-text database, once it grows to a significant size. Journal articles --
especially when they include inverted full-text -- are not, and never
were, searched via prepackaged subject classifications or taxonomies
or aggregations. And even those taxonomies and aggregations that exist
were generated by machine analysis of the database rather than by human
classification. (In other words, they were generated by "semantic-web"
-- i.e., syntactic-web! -- computations on the full-text database.)

See Subject Thread:
    "Interoperability - subject classification/terminology"

I know that especially in the humanities, many scholars and librarians are betting
otherwise. It will be interesting to see what the outcome turns out to be.

But let it be stressed again: This has nothing to do with open access, except
inasmuch as it is extremely important not to hold back open access for even one
microsecond in order to wait for classification/taxonomy values to be added -- any
more than open access should be delayed in any way to wait for preservation values
to be added.

The intuitive point to keep in mind is that we are talking about OAI
eprint space, not google space. Needle/haystack problems in google space
vanish when it is contracted to just the OAI eprint subspace. OAI eprint space
consists of the yearly 2,500,000 articles in the planet's 24,000 peer-reviewed
journals in all fields and languages, before (preprints) and after peer
review (postprints).

Stevan Harnad

