Interoperable Research Archives for Both Papers and Their Data:
infrastructure for all users of scientific research
Summary: There is a growing
number of online document archives in which researchers archive their research papers (e.g. Berkeley,
SUNY, Michigan, Cornell, Harvard, Virginia). There also exist large data
archives in many disciplines that
archive the data gathered and analysed during the course of research (e.g.
ESRC UKDA, AHDS).
Recent years have seen great progress towards
global interoperability between archives of research reports. For example, the Open Archives Initiative (OAI) at
Cornell is developing interoperability standards for the research literature.
The OPSIS project at Southampton (http://www.eprints.org/)
is applying the OAI standards in archive-creating software used by universities
and research institutions world-wide, so that their refereed research output can be harvested into a global “virtual archive accessible
to anyone, for free (http://www.soros.org/openaccess). The OpCit project builds on this foundation,
citation-linking archives for navigation and providing scientometric analysis services to
help enrich the data provided by the original archives.
However, published documents are just one of
the end products of research which it is beneficial to share between researchers.
Other research results such as empirical data, both raw and analysed, are
also recognised as being of use to researchers other than those involved in
its creation. Furthermore, other stakeholders in the scientific community,
such as the funding bodies, institution management and the public at large,
have a need to cross-reference this research output with information concerning
the research projects themselves, such as their objectives, timescales, institutions,
investigators and funders.
In each of these sets of information, there
is great potential benefit, not only from integrated access to the end product
- the paper, the analysed data, the project report - but also to the intermediate
information collected along the way. The ability to collect and analyse all
information related to a research programme would benefit from an "end-to-end"
integration of pertinent e-information. (Imagine all information pertinent
to a university department’s research assessment submission being generated
in by a uniform procedure at the touch of a button.)
All of this is part of the cumulative, collaborative
and public process of doing science. There are three classes of information where this
interoperability could be achieved:
There are two limitations with these services:
Both the douments and the data
must be archived at a central site where metadata or bibliographic indexes
have to be created by skilled specialists, consuming human and financial resources
at the central site.
These systems do not support
the continuous process of doing science (from conception, through proposals,
project activity to formal publication, and beyond).
This is a proposal to extend an existing self
archiving tool (Eprints), which currently addresses the first problem for
research documents, to research data sets (as well as documents generated
in the course of doing science). The existing tool (the Eprints software,
developed by Southampton University) and the OAI protocols it uses can be
extended to address scientific data. The resulting demonstrator will be made
publicly available for evaluation. It will be used to determine the feasibility
of further extending the tool to address document and data objects for the
entire scientific research continuum.
Publication (pre-referee/final drafts) - a continuum between pre-refereed
research reports and post-refereed final refereed drafts of research. also
For example, CLRC hopes to link together
information related to all aspects of an experiment, from the application
for beamtime on the synchrotron, through the instrument parameters for the
run, the raw experimental data, the archived processed data, and the reports
and papers from the experiment.
Furthermore, the extant body of scientific
information in each of these classes is increasing rapidly with time. So
is the degree to which scientists need to collaborate in gathering, storing
and analysing that resource. Today, the Internet is the natural way of sharing
and collaborating with smaller-scale data and analysis but the size and complexity
of the data-sets that can be shared and analysed online is growing rapidly.
With the development of the Grid computational infrastructure underway, now is the appropriate
strategic time to develop the information management infrastructure that will exploit that resource to
open up access to the enormous corpus of scientific information as it expands.
Open Archives Initiative
based at Cornell University and supported by two consortia (Digital library
Federation, Coalition for Networked Information), provides the Open Archive
Metadata Harvesting protocol that runs through web servers and clients to
connect Data Providers to Data Services. The data provider is somebody who
archives information on their site, while the service provider runs the OAI
protocol to access the metadata. The OAI provide web pages for data providers
and service providers to register so that they can know of each other's existence,
and thereby bring about interoperable access.
The OAI protocol was originally designed for
e-prints, although OAI acknowledge that it needs to be extended to cover other
forms of digital information. The OAI protocol demands that archives use
the Dublin Core metadata format, although parallel sets of metadata in other
formats are not prohibited.
The main commercial and public domain alternative
to the OAI protocol is the IEEE Z39.50 one which is widely used by large archives.
Z39.50 has been available in implementations since the early 1990's whereas
OAI was only released in January 2001, so it is considerably more recent,
although a revised version 2.0 is expected in the second quarter of 2002.
The contrast between the two is that Z39.50 supports greater functionality
than OAI and therefore is more complex to implement in the HTTP server and
The ePrint software is one implementation of
an OAI protocol conforming server implemented in Perl on the UNIX operating
system. It draws on a set of freely publically available tools (MySQL, Apache
etc..). Eprint V2 was released on Feb 14th 2002, providing improved
installation and database text indexing and normalising. The ePrint software
has been taken up by a growing
number of sites worldwide who have produced substantial local archives
as document data providers.
The limitation in OAI is in the limitation of
the metadata representation to Dublin Core only. The ePrint software uses
a more complex nested metadata structure to represent archived items internally.
This is accessible for search from a web page on the archive machine itself.
However, the interoperable service provided to OAI data services only supports
the Dublin Core subset of this.
The main alterations expected to the ePrint
system would be:
Empirical Data (including metadata about the method etc) - from raw
experimental output to analysed data to condensed, extracted, summarised and
Funding & Project - from grant application, through data and reports
through to final reports and project reviews.
Improve the documentation which
is at present is minimal.
The ePrints and self-archiving initiative
does not undertake the filtering
function of existing libraries and archives, nor their indexing
function or preservation
function. These will be addressed by complementary digital library initiatives.
The document archiving facilities of the Eprints
software, developed by Southampton University, can now be extended to provide
storage for raw scientific data as well as the capability of interoperable
processing. There is already momentum for widespread adoption of the Eprints
software by universities and research institutions worldwide for research
report archiving. It is natural to draw upon that momentum for this generalization
of the initiative to data archiving. It is also natural for the UK to consolidate
its lead in the overall digital archiving area in this new and important way.
Improve the installation procedures
which at present require a skilled UNIX administrator to install the software.
This should be done using one of the commercial installation packages such
as InstallAnywhere. These should be able to be performed by an experienced
The structure of the metadata
representation used currently cannot be changed in the time proposed for
the project. It will therefore be necessary to map the existing document
metadata structure to the hierarchical structure used for data metadata.
A new metadata format needs to
be defined within the ePrints system to support this mapping.
From A Two-Phase Publication
Model to a Continuous Dissemination Model
Figure 1: Open Archiving Article Publication
The influence of the Internet, and latterly
the World Wide Web, has been both to open up and speed up the publication
process. Electronic journals can turn over their submissions more quickly
than previously, while the opportunities for institutional self-archiving
have allowed both un-refereed pre-prints and refereed post-prints (collectively
known as e-prints) to be widely and instantaneously disseminated. Studies
of the Los Alamos High Energy Preprints archive show that most articles are
read within a week of their initial deposit and begin to be cited in new
articles a month later. Each article
has a number of revisions which correct mistakes in earlier versions or address
issues raised by referees, leading to a publication process as shown in Figure
1 and a cycle of interacting articles in Figure 2.
The work with Open Archives highlights the fact that articles cannot be considered
as ‘point-like events’ which suddenly come into being. Instead, they have
to be seen as part of a larger continuum
of scientific research and communication. This process has consisted
of the loosely synchronised reporting, reading, evaluating, publishing and
citation of research results. Scientific dissemination occurs as communities
of practice, with conference, workshop and seminar communication supplementing
the formal modality of communication: journal publication. The Internet provided
an informal method for trading preprints; the HEP Archive grew from such
beginnings into a major piece of research infrastructure with significant
journals built atop it [JHEP].
This model extends the process of publication
into a continuum of dissemination; however, it begins arbitrarily at the point
at which the first draft of a
report is publically deposited in an archive. In reality, the scientific process consists of a continuous
cycle of reading, experimentation, authorship, review and publication [OSP].
The purpose of experimentation is to formulate and test the truth of hypotheses;
the purpose of authorship is to report the results of the experimental process
in a written form; the purpose of review is to test the quality of the findings
and the purpose of publication is to disseminate those findings.
In order to accommodate a fuller model of scientific
dissemination than the hitherto report-oriented publication cycle, we need
to be able to deal with experimental data.
Some of this capability is already in place,
but further interoperability needs to be designed for information that does
not consist only of text objects or multimedia sound/images, but arbitrary
raw data. The existing metadata structure supports ePrints that consist of
a set of documents which in turn consist of sets of e-print files (e.g. an
html document can consist of a set of files). The standard scientific data
metadata that CLRC use on their Data Portal has a structure built around a
study which consists of a set of experiments, each of which consists of a
set of data files. It is clear that a mapping exists between the two three
layered structures that can be used to represent experimental studies with
the existing technology. Publishing models that are not based on a scientific
communication as ePrints is ePrints is, and do not take into account revisions
and multiple documents within the first order entity could not support this
To implement this new metadata structure it
will be necessary to implement:
new strings bundle files for the
new metadata input and browse facilities on the server
to the success of such initiatives is uptake. It is essential that installation
should be as simple as possible and maintenance should be minimal. Software
will be open-source and free thus requiring central funding). A model is needed, both for University Provosts/Presidents
and for University Libraries, to show them how they can implement and facilitate
the self-archiving of all their refereed research documents as well as their
OAI-compliant document and data providers will
also require service clients. An objective of the project will accordingly
be not only to provide a demonstrator service, but a client that can be used
to access the service. University of Southampton have such a client as part
of the OpCit project.
a new mapping from the data metadata
to Dublin Core for OAI access
new metadata fields for the representation
The long-term aim is the integrated management
of all e-information relevant to the scientific research community. The immediate
objective is to provide a tool to allow the UK science community to archive
their own data and documents, making them available for secondary analysis.
The proposed project will focus on the key objectives described below.
Analyse the requirements for protocols,
indexing methods, brokers, indexing tools and search tools to extend Eprints
to address data as well as documents.
Requirements. The experts in scientific data structure
with experience in developing a data portal for e-science at CLRC will work
with the developers of the Eprints tool at the University of Southampton
to draft a set of requirements for the open archiving of scientific data
and documents throughout the science process. This will be based upon a business
model of the scientific process, as well as on the protocols used by existing
archives, the needs for authentication and authorisation, and planned UK
initiatives such as DNER. The metadata
structure required to support scientific data will be specifc, providing
an initial overall design of extension to e-print required to incorporate
Design and implement extensions
to the the Eprints tool to archive data.
Document the ePrints tool.
Improve the installation procedures
for the ePrints tool.
Develop a client tool using OAI
protocol to allow access to the distributed ePrints network.
Design and implement a simple
portal that allows uses to acquire the new tool and to register their archive,
and to support the evaluation of the approach.
Evaluate the effectiveness (recall
and precision), efficiency (time, costs and performance demands), usability,
and learnability of the portal and tool.
Produce a feasibility report on
further developing this architecture to scale up to the whole science data
and documentation cycle, integrating the portal with existing archives and
resource discovery services.
Eprints Tool Design and Extension. The UI of the Eprints tool and
its supporting functionality will be extended to incorporate data as well
as documents, by modifying the metadata representation and the strings files
for the UI, as well as the mapping to Dublin Core for the data metadata.
and installation procedures for e-Print will be improved and made more robust
to enable a wider public audience to adopt it by using one of the standard
installation tools such as InstallAnywhere. The documentation of the ePrints
tool will be extended and made more usable and useful. To develop a client
tool that would allow access to the ePrints system using OAI protocols as
Demonstrator Portal Design and Implementation.
A portal will be designed and implemented to allow the downloading
of the Eprints tool, the registering of archives created with it, and linking
to existing science archives. This portal will server as a populated demonstrator
of the technology.
Evaluation. The tool will be
made available to the public through the portal for download and use. Also
a detailed diagnostic evaluation will also be made of its use by scientists
at the CLRC RAL. The third form of evaluation will address the effectiveness
(recall and precision), efficiency (time, costs and performance demands),
usability, learnability, of the portal and tool.
Scalability. On the basis of the requirements, and the evaluation
of the tool and portal, scalability estimates to new protocols, data volumes
etc.. will be calculated to assess the feasibility of further developing the
approach and integrating it with DNER and the DataGRID. Scalability to a
wider solution would involve extending the protocol set beyond OAI to include
Z39.50 Web Services protocols such as UDDI, SOAP and WDSL and possibly Globus
protocols. This extended protocol set would have to be planned for in the
Project management. Management by University of Southampton will be by senior academic
staff who are paid for research as part of their normal contracts. Project
management at CLRC requires staff to be charged to projects. Therefore one
person month of effort will be used over the six months of the project.
Centralized eprint archives like the Physics ArXiv (mirrored at http://xxx.soton.ac.uk)
CogPrints (Cognitive Sciences) Archive
OpSIS ["Open-Sourcing Instititional Self-archiving
software"].) grant is from DNER
(sub)-Initiative of OAI. Universities
need to mandate online CVs with links to all their researchers' refereed
papers in the University's Eprint Archives; and Libraries need to stand read
to "self-archive" their own researchers' papers for them by proxy:
http://cav2001.library.caltech.edu/permission.html for details Librarians
handle most document-management and metadata issues. For the repositories
we just registered, this work fell primarily on Kim Douglas (Director of the
Sherman Fairchild Library and Head of Technical Information Services) and
Hema Ramachandran (Reference Librarian). However, all librarians are getting
involved as they are signing up new options.
Kim worked with Adam Cochran (Caltech's Intellectual
Property Counsel) on copyright issues. Authors who deposit material in our
repositories retain copyright, but grant a non-exclusive, royalty-free license
to Caltech. (See Of course, Caltech makes this material available for free.
(See http://cav2001.library.caltech.edu/copyright.html for details.) The
technical aspects are taken care of by the Library Information Technology
group members Ed Sponsler and Betsy Coles. Ed focuses on Eprints and Betsy
on ETDs. Both work closely together, and both support our librarians as they
learn to conquer all the document-management issues related to these projects.
ESRC UK data archive (University of Essex) -
similar in other countries through CESSDA – Common DDI metadata standard in
XML – the Common Data Documentation Initiative (DDI). Use NESSTAR as common
access tool (client and servers).
Art and Humanities Data Service (gateway to
UK arts archives) – King’s London.
RDN - Resource Discovery Network, JISC UK network
including Athens Authentication and Authorisation server
DNER - JISC planned replacement for RDN including