Re: Interoperability - subject classification/terminology

From: Claus Schroeter <>
Date: Thu, 27 Mar 2003 13:16:26 +0100 (CET)

Hi OAIers,

I agree with hussein so far. Building taxonomies is a lot of
effort. Just for inspiration I'd like to tell what we're doing at this

We created a specialized Search engine that is not only fulltext boolean
based but also enables users to go deeper in thematic fields. We do this
by trainable classifiers that are able to learn which field is discussed
in a text. We're not using the OAI classifications since the underlying
taxonomy varies from archive to archive and we have also material from the
web that is not categorized.

The basic idea behind this system is to enable users to explore
fields of knowledge just by selecting the fields of interest. As a
positive side effect the taxonomy of this system is free configurable so
you may configure a taxonomy for physics, one for chemistry or whatever.
All taxonomies use the same texts in background.

Spoken in technical terms the taxonomy binding of a text is not fixed.
Perhaps it could be a good idea to implement this loose binding for OAI so
taxonomy bindings can be used as an exchangeable schema for archive

If you're interested please feel free to take a look at:

Try for example the category ->Physics/Astrophysics and use the "further
restrictions" selector to restrict on ->Physics/History of Physics or
->Physics/Geophysics. You will see that the ranking will switch to
exactly the subtopic of interest.

On Thu, 27 Mar 2003, Hussein Suleman wrote:

> well, sure, i agree in principle ... if arXiv and similar projects agree
> to bunch of all physics into a single category and use google for
> searching, with no browsing capabilities, it wouldnt be a problem at all.
> similarly, if we grouped together computer science, electrical
> engineering and information systems, that would be ok for gross-level
> interoperability ... once again, assuming searching is the only service
> required. frankfully, i think this is a little simplistic and assumes
> digital libraries are no more than submission+search systems.
> [aside: why does eprints support browsing by catgeories ?]
> besides, who decides what constitutes a discipline anyway ? has anyone
> ever been able to decide if computer science is engineering or science ?
> i think we have more questions than answers here and it isnt as simple
> as you point out or we wouldnt even be discussing this :)
> Stevan Harnad wrote:
> > On Thu, 27 Mar 2003, Hussein Suleman wrote:
> >
> >>...why not use sets for the separate
> >>disciplines, aimed at particular service providers?...
> >>some disciplines are not well-defined (namely, computer science)
> >>so such archives may want to play ball with multiple service providers
> >>and hence may need different sets.
> >
> > The question of taxonomic classification sets and version-control for
> > Open Archives is a technical one, so I will not presume to comment on it
> > except from the point of view of the potential *users* of one particular
> > kind of Archive Content, namely, unrefereed preprints and refereed
> > postprints of research papers from one or many or all disciplines: This
> > -- in the google-age of boolean inverted full-text searchability --
> > does not require a detailed a-priori taxonomy, as book metadata or the
> > metadata for other kinds of material might. A fairly general sorting by
> > discipline should suffice.
> >
> >
> >
> >>...the service provider can provide an
> >>interface for potential data providers to self-register.
> >
> > I hope that once the number and contents of Open-Access Eprint Archives
> > for research preprints and postprints have scaled up toward something
> > closer to universality, the simple metadata descriptors "pre-refereeing
> > preprint" and "refereed journal article" plus perhaps "discipline name"
> > will be enough to guide relevant service-providers in automatically
> > harvesting their relevant metadata. Multiple self-registration seems a
> > tedious and unnecessary constraint. (Possibly a master-registry of valid
> > institutions and disciplinary archives will also help, but may not be
> > necessary unless commercial spamming invades this sector too.)
> >
> >>what remains a difficult problem, however, is how to recreate the
> >>metadata used by the service provider as its native format. so, for a
> >>typical example, if arXiv classifies items using a specific set
> >>structure, this is certainly not going to be the default for an
> >>institutional archive. does the service provider automatically or
> >>manually reclassify? or does it not allow browsing by categories?
> >
> > Worrying about "recreating the categories" in this boolean full-text age
> > is, I believe, a waste of time (for research preprints/postprints). Just
> > harness google's harvested full-text to your engine's search capability,
> > if it is incapable of contending with boolean full-text search on its
> > own. (Manual reclassification! Heaven forfend! Don't bother classifying
> > this material in the first place, beyond the simplest of first-cuts,
> > such as discipline. Any further classification should be algorithmic and
> > text-data-driven, not manual.)
> >
> >>in either event, the quality of the metadata from the perspective of the
> >>service provider may be an impetus for potential users to want to
> >>replicate their effort rather than rely on the automated submission from
> >>their own institutions ... this needs more thought ...
> >
> > Again, I speak only for research preprints/postprints, but please let's
> > not inject any further credibility into the notion that self-archiving
> > author/institutions will also have to self-advertise by multiple
> > self-archiving of the same paper. Surely that is one headache that
> > OAI-interoperability should eradicate from the planet! Self-archiving
> > itself is self-advertising (and effort) enough. Please let us not
> > now -- when the momentum is still not big enough -- saddle would-be
> > self-archivers with needless extra worries, and tasks!
> >
