Tim Brody, Simon Kampa, Steve Hitchcock, Les Carr, Stevan Harnad
We describe “digitometric” services and tools that add value to open-access e-print archives using the Open Archives Initiative (OAI) Protocol for Metadata Harvesting. Celestial is an OAI cache and gateway tool. Citebase Search # enhances OAI-harvested metadata with linked references harvested from the full-text to provide a web service for citation navigation and research impact analysis. Digitometrics builds on data harvested using OAI to provide advanced visualisation and hypertext navigation for the research community. Together these services provide a modular, distributed architecture for building a “semantic web” for the research literature.
In this paper we describe services that # apply and extend # the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) as a means of building user services for the scientific and scholarly research literature.
The services described in this paper touch on a number of digital library topics: infrastructure, accessing legacy data, harvesting, electronic publication, scientometrics, and linking. This covers both using existing data through discovery and conversion, and building new data through processing and analysis.
As authors increasingly use e-print
archives (built using free tools such as
The first part of this paper provides background about the OAI-PMH. We # introduce # Celestial # (a cache/gateway for the OAI-PMH) and Citebase (an end-user service that applies citation-analysis to existing OAI-PMH compliant e-print archives). We then analyse Citebase’s database, and summarise the findings of a user survey conducted by the Open Citation Project on Citebase. Finally, we introduce some of the new directions arising out of this work # - creating a knowledge environment built on the OAI-PMH.
The Open Archives Initiative Herbert Van de Sompel, Carl Lagoze (2002) “Notes from the Interoperability Front: A Progress Report on the Open Archives Initiative” ECDL 2002, in LNCS 2458, pp. 144-157 Protocol for Metadata Harvesting (OAI-PMH) is designed to address the need to expose metadata - titles, authors, abstracts etc. - from research literature archives in a structured form. An XML protocol built on the HTTP standard, OAI-PMH is in effect a CGI interface to databases. Based on 6 commands (or “verbs” in OAI parlance) OAI-PMH allows metadata to be incrementally harvested by service providers (the HTTP client) from data providers (the HTTP server).
There are 62 OAI-registered publicly accessible data providers, exposing around three million records covering research literature (e.g. arXiv.org), music manuscripts (Library of Congress), theses, and others. Some service providers have been developed or been adapted to make use of OAI-PMH, e.g. Scirus that allow users to search both commercial abstract databases and the freely available abstracts from public data providers. In the USA OAI-PMH is being used to build a large-scale distributed library system, NSDL [Carl Lagoze, William Arms, Stoney Gan, Diane Hillmann, Christopher Ingram, Dean Krafft, Richard Marisa, Jon Phipps, John Saylor, Carol Terrizzi, Walter Hoehn, David Millman, James Allan, Sergio Guzman-Lara, Tom Kalt (2002) “Core Services in the Architecture of the National Digital Library for Science Education (NSDL)” Proceedings of the second ACM/IEEE-CS joint conference on Digital libraries 201-209 http://arxiv.org/abs/cs.DL/0201025].
The OAI-PMH allows the transfer of metadata records encoded in XML. To be OAI-compliant a data provider must expose their records in Dublin Core, but they can expose their data in any format that can be encoded in XML.
The metadata records that describe a single entity form an item, identified by a unique identifier.
The OAI-PMH is being used to transfer sizeable amounts of data - in the case of http://arXiv.org/ some 230,000 metadata records. As the number of OAI-PMH sites increases (an M:N relationship between providers and services), and the amount of data within data providers grows [it would be good to include some of out growth data here], there is a growing need to build scalable infrastructures to support the transfer of data from data providers to service providers. Caching is a useful method to distribute the load within such distributed systems, e.g. Celestial.
Celestial is software that supports the caching of metadata from OAI archives, gateways between legacy (1.0 and 1.1) and current (2.0) OAI implementations, and attempts to correct incorrectly implemented OAI archives.
In a distributed environment caching moves processing and network load away from the source and closer to the target (Figures 1.1 and 1.2). As OAI archives are often small and low-performance, reducing the load on them can be important – especially where the OAI-PMH interface may be seen to interfere with other services. To support the caching of OAI responses Celestial acts as an OAI cache/proxy. Working at the application-level it harvests records from data providers using the OAI-PMH, and re-exposes them to service-providers through its own OAI-PMH interface. Celestial is able to make a complete copy of an OAI repository, including all the metadata records, and set memberships associated with an item. Should the data provider become unavailable, Celestial is able to act as a surrogate.
By using the incremental, datestamp-based harvesting ability of OAI-PMH, Celestial only harvests those records that are new or have changed from a data provider. By comparison an HTTP cache would have to query all records to determine whether they had altered from a prior harvest.
Celestial is designed to provide as high performance as possible. It achieves this by trading storage space for performance. A significant overhead with any XML-based application is generating the XML tag structures. To avoid this Celestial stores the OAI header and metadata as XML. When generating a response Celestial prints the raw data, and only needs to generate XML tags for the OAI protocol components (e.g. the request header, and flow-control tokens).
OAI-PMH flow-control is handled using stateless cursors. Celestial assigns each record a datestamp and unique identifier. These two values are joined to form an index into the record list. As a harvester retrieves records Celestial moves a cursor along this index, and at the end of a partial list Celestial provides the harvester with the current cursor (the datestamp plus unique identifier), and an encoding of the original request (which might include a set or datestamp filter) in the OAI-PMH resumption token. Given a resumption token Celestial can jump straight to the end of the previous partial list by using the index key.
If new records are added to Celestial during a harvest they will be returned at the end of the harvest, as the new record’s datestamp will be greater than any previous records. This makes the resumption tokens generated by Celestial stateless, as no changes can occur that would make the result set inconsistent.
OAI archives that do not support 2.0 have been removed from the official OAI-compliant list (and hence unlikely to be included in new OAI services). As Celestial provides an OAI 2.0 (the current version) interface to harvesters, but can itself harvest from version 1.0, 1.1 or 2.0, it acts as an OAI gateway between deprecated data providers and current service providers. In OAI 2.0 each record has the set membership of that record. To provide the set hierarchy to OAI 2.0 harvesters Celestial inverts the set membership exported by an OAI 1.x repository. For OAI 1.x this set membership is found by exhaustively querying each set, building up the set membership for each item.
Often data providers will export records from sources that are not Unicode-based. If a data provider does not convert and check these records before exporting them, bad characters can appear in the data provider’s OAI-PMH export, preventing XML parsing. Celestial makes a best-effort to correct these errors by replacing the location of bad characters (as reported by the XML parser) with a valid character, “?”. The process of XML parsing, correcting characters, and re-parsing can be repeated until either the OAI-PMH response can be parsed or the act of replacing encroaches on the XML tags and makes the response unrecoverable.
As well as attempting to fix OAI-PMH responses in real-time, Celestial records errors that occur during harvesting. An archive administrator can use these harvest logs to correct mistakes in their implementation, or underlying data records. As the OAI-compliance tests do not make a full harvest of repositories, this can often highlight problems (e.g. with flow-control) that the OAI registration process does not.
Celestial implements the OAI provenance schema. This records the path that records have taken through OAI proxies, caches and aggregators, by storing with the metadata record the location from which the record was harvested, when it was harvested, and whether any alterations have been made. Provenance data can be used by service providers to “de-dup” the same record, if the service harvests from multiple sources.
A promising possibility for Celestial is as a tool for exposing any data source via an OAI-PMH interface. Out of the box, Celestial only supports getting data via OAI. It is relatively easy, however, to create a system that would insert records direct into Celestial’s back-end database, from which Celestial can then serve through its OAI-PMH interface.
While Celestial is a distinct, freely-downloadable
software package, at Southampton University [University
Citebase, more fully described by Hitchcock et al. (2002), allows users to find research papers stored in open access, OAI-compliant archives - currently arXiv (http://arxiv.org/), CogPrints (http://cogprints.soton.ac.uk/) and BioMed Central (http://www.biomedcentral.com/). Citebase harvests OAI metadata records for papers in these archives, as well as extracting the references from each paper. The association between document records and references is the basis for a classical citation database. Citebase is best viewed as a kind of “Google for the refereed literature”, because it ranks search results based on the number of references to papers (or authors) (although it is not – currently – using a hub-authority graph algorithm to rank). Citebase contains 230,000 full-text e-print records, and 6 million references (of which 1 million are linked to the full-text).
Primarily a user-service, Citebase provides a Web site that allows users to perform a meta-search (title, author etc.), navigate the literature using linked citations and citation analysis, and to retrieve linked full-texts in Adobe PDF format. Citebase also provides a machine interface to the citation data it collects through its own OAI-PMH interface using the Academic Metadata Format (AMF), a new XML format for scholarly literature. As part of the development of Citebase we have looked at the relationship between citation impact (“how many times has this article been cited”) and usage impact (“how many times has this article been accessed”).
Citation-navigation provides Web-links over the existing author-generated references. As well as following references to the cited full-text, Citebase provides the user with links to articles that have cited, and to articles that have been co-cited alongside (hence are related to) the current article. This allows the user to navigate back in time (articles referred-to), forward in time (cited-by), and sideways (co-cited alongside).
Citebase provides information about both the citation impact and the usage impact of research articles (and authors), generated from the open-access, pre-print and post-print literature that Citebase covers. The citation impact of an article is the number of citations to that article. The usage impact is an estimate of the number of downloads of that article (measured from one arXiv.org mirror).
The front-end of Citebase is a meta-search engine. This allows the user to search for articles by author, keywords in the title or abstract, publication (e.g .journal), and date of publication. After generating a search, Citebase allows the results to be ranked by 6 criteria: citations (to the paper or authors), Web hits (to the paper or authors), date of creation, and last update. The by-author ranking is calculated as the mean number of citations and hits to an author (e.g. total citations divided by total papers to author “Hawking, S”). A per-paper author-impact is then calculated by taking the mean author-impact of the named authors. [Mention that it’s 1st author only? And that unique-naming is still dodgy?]
From the meta-search users can either choose to view an abstract page, or jump directly to a cached full-text PDF (if available) for each matching record.
The abstract page displays a full meta-record (title, authors, abstract, rights etc.), the harvested reference list (if available), i.e., all articles cited by the target article, all articles that have cited the target paper, and articles co-cited alongside the target paper. In addition to listing the citing articles, Citebase provides a summary graph that shows over time when the citing articles have appeared, and when the current article has been downloaded. This provides a visual link between the citation and web impacts. [Please include a sample of this. It is a very good datum. Use a good one and since 1999, when you have both the citation and the download data.]
When viewing a cached PDF Citebase overlays reference links within the document, so a user can jump from viewing a full-text to the abstract page of a cited article.
Like the archives it harvests from, Citebase provides an OAI-PMH interface to the data that it contains. Along with re-exposing the Dublin Core metadata (title, author, abstract), Citebase provides records in the Academic Metadata Format (AMF) []. AMF encapsulates the relationships within the scholarly research: between authors, articles, organisations, and publications. Other services can harvest this enhanced metadata from Citebase to provide a reference-linked environment, or perform further analysis (or they can be harvested by the source archives to enhance their own data).
Since the creation of
Web hit data can be subject to inaccuracies and noise. For example, if an arXiv paper is referenced by Slashdot [Erreur ! Source du renvoi introuvable.] it will receive many hits from casual users (the “Slashdot effect”), which probably do not reflect true impact on other researchers. Citebase filters Web logs by removing known Web crawlers (e.g. Googlebot), then only counting one hit from one location per day. This is probably an over-correction, for although it removes most “unwanted” hits, it also excludes valid hits from users who may be sharing a single machine, or Web proxy.
Given the success of the arXiv.org online archive, it
is not surprising that citation-impact and usage-impact (measured from the
**Figure 2: r = correlation between citation and web impact, n = size of result set, hep = High Energy Physics subset** [Table didn't display in netscape or Word, just explorer].
The correlation between citation impact and web impact is highest at the high end of the citation impact spectrum. Low impact papers disappear into obscurity, little read and little cited, while high impact papers continue to be read and cited for longer periods of time. But regardless of whether a paper is destined for obscurity or fame, it still gets a burst of Web impact soon after it is first released on the web (probably coming from users who are following that topic either through the arXiv's automatic daily/weekly new-paper alerting service or through regular active browsing of new contents).
Figure 3: The relation between citation and web impact over time (citations-to and downloads-of an article after it is deposited), by citation impact quartile [again, awfully (un)explained! Explain what the curves mean, and what they show!]
Citebase was developed as part of the Open Citation Project, which officially ended in 2002. For part of the project's final report a user survey was conducted on Citebase to evaluate the project as well as to seek feedback about the direction users would like to see Citebase take as an ongoing OAI service. [I am not sure why you are reporting the relatively wishy-washy summary below: Either pick out the substantive positive responses or give less detail, I would think.]
The survey asked users to evaluate Citebase as a research tool. It was found that “Citebase can be used simply and reliably for resource discovery… [T]asks can be accomplished efficiently with Citebase regardless of the background of the user.” **Some areas were found that could be improved:
1. More data needs to be collected and the process refined before it is reliable for measuring impact. As part of this process users should be encouraged to use Citebase to compare the evaluative rankings it yields with other forms of ranking (e.g. SLAC/SPIRES is a similar service for High Energy Physics).
2. Although the majority of users were able to complete a task involving all the major features of Citebase, user satisfaction appeared to be markedly lower when users were invited to assess navigability than for other features of Citebase.
3. Citebase needs to be strengthened considerably in terms of the help and support documentation it offers to users.**
Approximately two thirds (75,000 of 100,000 in February 2003) of the traffic to Citebase is from users going directly to the abstract page of an article. This is traffic generated from arXiv.org, which links directly from the equivalent abstract page at arXiv.org to the abstract page at Citebase. A challenge for Citebase is increasing the number of users that come directly to Citebase, as well as being referred to from arXiv.org. Addressing the concerns raised by the user survey will greatly improve Citebase’s utility as a “first port of call” for researchers.
The activity of scholarly research involves a process of systematic investigation and collection of information relevant to a particular research field. It is the sum of disparate activities -- empirical, theoretical, and scholarly, including detective work that requires becoming proficient in a field and understanding its present and past literature's landscape. This is a continuous task that requires a scholar's constant attention.
The Digitometric [I recommend this as more distcinctive and better reflecting your contribution that the bland generic "E-services"] framework provides advanced services over research metadata collected from various repositories through the OAI protocol to enable researchers to navigate, evaluate and keep up with their growing growing literatures. It provides an open and extensible interface to add advanced services over research metadata. Currently, simple visualisations (e.g. bar charts of papers/authors) to advanced visualisations (e.g. co-citation maps), knowledge services (e.g. identifying the most prominent researchers), and hypertext linking of different research artifacts have been implemented.
**The Digitometric software runs as a collection of programs and a back-end database to store the collected metadata. The service **contacts specified** [??] OAI compliant servers (e.g. E-Prints.org) to retrieve the metadata. It also exposes its own metadata for other services to use, thus representing an open solution to managing and using scholarly metadata.
Users access the services through a Web interface. An administrator uses this to specify which archives to collect metadata from and end-users then use the services that the framework offers over that metadata. Initially, the Digitometric database was populated with the entire collection of papers in the arXiv archive. This provided a large base for which to explore the service and create large and detailed visualisations for scholars to explore their research landscape.** [Last 2 paragraphs were rather weak and empty… Compress and give more substance]
The Digitometrics can be used to dynamically visualise different aspects of research metadata. For example, users can retrieve basic graphs illustrating the highest publishing authors or the citation network of a particular publication (Figure 1). [Figure 1? Isn't this a later Figure. Also, your figure explanations are just awful! You must make them full-comprehensible,self-contained, self-explanatory, leaving nothing out, leaving nothing to be guessed: Kid sib! I haven't the faintest idea what this figure means. Also, I suggest you change e-Services to Digitometrics soon, before you get stuck with the bland and and unexplanatory e-Services!]
More interestingly, co-citation maps can be dynamically created and displayed. Two papers are co-cited when a third paper cites them both. Co-citation analysis relates bibliographic data based on co-citation strengths (i.e. the number of times two papers are cited together). These values are used as proximity measures in visualisations where papers that are frequently co-cited are plotted near each other. The resulting graph enables "research fronts" to be identified, the theory being that a research front will usually emerge around a few seminal (or core) papers that are heavily co-cited [Small, H., Co-Citation in the Scientific Literature: A New Measure of the Relationships Between Two Documents, Journal of the American Society for Information Science, 24, pp265-269, 1973.]. [I still don't understand research-fronts or co-citation from this: Just what do co-cited papers have in common, and how can it be useful in navigation, evaluation, prediction, and retrodiction and analysis of the growth and direction of findings?]
analysis has been criticised
[Edge, D., Why I am not a
co-citationist?, Society for Social Studies of Science Newsletter, 2, pp13-19,
Source du renvoi introuvable.] for over-simpifying
the citation link, for technical
problems (e.g. inaccurate citations), and for focusing only on citations when other
factors (e.g. social/political motivation behind the citation) should be taken
Garfield (cite ref!) neverthless notes
that co-citation provides a useful and
predictive perspective on scholarly material when used cautiously
and wisely [Erreur !
Source du renvoi introuvable.]. He shows convincingly how he has uncovered important
historical links between research fronts using co-citation analysis that
scholars had previously overlooked [Garfield,
Users of Digitometric services can retrieve a co-citation map at any time and even use a particular publication as the launch-point for navigating co-citation space. Different variables are set to define the accuracy, size and co-citation threshold to be used, each having a significant bearing on the duration required to construct the maps.
Figure 2 [2?] illustrates a simple co-citation map embedded within the Digitometric user interface. The nodes on the map represent individual publications. By hovering with the mouse pointer over a node, the user can generate details (title, author, abstract) # in the information box. The arcs between the nodes represent a co-citation relationship. A cluster of related publications are evident in the centre of the map. Four distinct paths emanate out of this indicating the possibility of speciality fields arising out of the cluster.
Figure 3 indicates a full-sized co-citation map with a lower co-citation threshold resulting in more nodes being included. Several clusters (research fronts) are immediately evident. Researchers may get a better understanding of their research landscape by exploring these clusters and the relationships between them. [This cluster stuff requires some clearer and more specific explanation, otherwise it is just pretty pictures. Maybe illustrate Garfield-like effects?]
**Accurate and useful co-citation maps require large amounts of high integrity data. While there were large amounts of data available for these visualisations, they were sometimes inaccurate or incomplete, which resulted in slight distortions to the maps.** [What is the point of saying this? If you want to discuss this, you have to say what the inacccuracies and incompleteness are, and what the remedies are; otherwise what is the point? it is vague and uniformative.]
Digitometrics can also analyse the metadata and infer new facts. Such capabilities are central to the Semantic Web initiative. For example, based on the citation patterns between papers, the most significant papers and prominent researchers can be detected. The contributions to a particular line of research or researcher can be mapped by analysing and displaying co-authorship patterns (Figure 4). With further high quality metadata available, researchers can raise increasingly subtle questions about the direction and time-course of developments in their research field, such as how perspectives have changed and how a particular methodology has affected a research area.
Digitometrics allows for complete hypertext linking between different research artifacts (e.g. researcher, publication, project, organisation, publication medium, text). When a service is offered for a particular artifact (e.g. collaboration among researchers, co-citation maps for literature) these are automatically available to the user (whether
author, user, or evaluator). Figure 5 illustrates the metadata gathered for a particular researcher, including a list of all their known publications, plus the collaboration measure #.
**Complete linking is only possible where all metadata are available. Unfortunately, the data used for the current application of the Digitometrics framework only includes literature and researcher information, and thus the hypertext could not be applied extensively.** [I don't understand the point of these uninformative apologies that you keep putting at the end of the section: If you have something substantive to say about them, do, otherwise state the positive substance and not the list of how it falls short of what might have been or will be!]
We have described services that provide a richer interface to users, built on the free-access literature and the OAI-PMH. Much of the challenge of building services for Open Archives will require overcoming the challenge of processing unstructured data, typically generated by authors. As these services develop, it will be in authors' interest to provide better data (especially where services are used by research-evaluators and funders to make value-judgements based on research impact), so that the automated systems can successfully parse and present their work to users. So increasing usage of these data and services will itself feed back into improving their future quantity and quality.
As archiving software develops it will provide authors with tools to help validate and process their work at the point of submission. As citation linking becomes more important it will be increasingly integrated into the depositing process, allowing an author to see (and correct) the automatic parsing, and to check the links that have been found.
The feedback loop between authors and the hypertext built from their contributions will provide ever improving data, and hence a richer environment for users to navigate #. This semantic web – where information is semantically marked-up – will help bridge the gap between information and knowledge. [Gets a bit scattered and diffuse toward the end: You need to summarise -- for yourself -- the thrust of your paper, and then make its structure and progression reflect that; it started out strong and directed, and then seemed to become more diffuse, with the apologies and then vagueness: I think you can make it more focussed and give its development from beginning to end more clarity, momentum and punch!]
1. Steve Hitchcock et al (2002) "Open Citation Linking: The Way
Forward". D-Lib Magazine, Vol. 8, No. 10, October
2. Xiaoming Liu, Tim Brody et al (2002) “A Scalable Architecture for
Harvest-Based Digital Libraries” D-Lib Magazine, Vol 8, No. 11, November
4. Celestial http://celestial.eprints.org/
5. Citebase Search http://citebase.eprints.org/
7. Open Citation Project http://opcit.eprints.org/
8. Open Archives Initiative http://www.openarchives.org/
Carl Lagoze, William Arms, Stoney Gan, Diane Hillmann, Christopher Ingram, Dean Krafft, Richard Marisa, Jon Phipps, John Saylor, Carol Terrizzi, Walter Hoehn, David Millman, James Allan, Sergio Guzman-Lara, Tom Kalt (2002) “Core Services in the Architecture of the National Digital Library for Science Education (NSDL)” Proceedings of the second ACM/IEEE-CS joint conference on Digital libraries 201-209 http://arxiv.org/abs/cs.DL/0201025
15. AKT reference http://www.hyphen.info/