Distributed Interoperable Research Archives
- an electronic infrastructure for all users of scientific research output
Keith Jeffery, CLRC
Stevan Harnad, University of Southampton,
Les Carr, University of Southampton
Wendy Hall, University of Southampton
Distributed Interoperable Research Archives
- a Proposal for an e-Science Infrastructure
There is a growing number of online document archives in which researchers archive their research papers (e.g. Berkeley, SUNY, Michigan, Cornell, Harvard, Virginia). There also exist large data archives in many disciplines that archive the data gathered and analysed during the course of research (e.g. ESRC UKDA, AHDS).
There are two limitations with these services:
1) Both the douments and the data must be archived at a central site where metadata or bibliographic indexes have to be created by skilled specialists, consuming human and financial resources at the central site.
2) These systems do not support the continuous process of doing science (from conception, through proposals, project activity to formal publication, and beyond).
This is a proposal to extend an existing self archiving tool (Eprints), which currently addresses the first problem for research documents, to research data sets (as well as documents generated in the course of doing science). The existing tool (the Eprints software, developed by Southampton University) and the OAI protocols it uses can be extended to address scientific data. The resulting demonstrator will be made publicly available for evaluation. It will be used to determine the feasibility of further extending the tool to address document and data objects for the entire scientific research continuum.
Recent years have seen great progress towards global interoperability between archives of research reports. For example, the Open Archives Initiative (OAI) at Cornell is developing interoperability standards for the research literature. The OPSIS project at Southampton (http://www.eprints.org/) is applying the OAI standards in archive-creating software used by universities and research institutions world-wide, so that their refereed research output can be harvested into a global “virtual archive accessible to anyone, for free (http://www.soros.org/openaccess). The OpCit project builds on this foundation, citation-linking archives for navigation and providing scientometric analysis services to help enrich the data provided by the original archives.
However, published documents are just one of the end products of research which it is beneficial to share between researchers. Other research results such as empirical data, both raw and analysed, are also recognised as being of use to researchers other than those involved in its creation. Furthermore, other stakeholders in the scientific community, such as the funding bodies, institution management and the public at large, have a need to cross-reference this research output with information concerning the research projects themselves, such as their objectives, timescales, institutions, investigators and funders.
In each of these sets of information, there is great potential benefit, not only from integrated access to the end product - the paper, the analysed data, the project report - but also to the intermediate information collected along the way. The ability to collect and analyse all information related to a research programme would benefit from an "end-to-end" integration of pertinent e-information. (Imagine all information pertinent to a university department’s research assessment submission being generated in by a uniform procedure at the touch of a button.)
All of this is part of the cumulative, collaborative and public process of doing science.
There are three classes of information where this interoperability could be achieved:
1. Publication (pre-referee/final drafts) - a continuum between pre-refereed research reports and post-refereed final refereed drafts of research. also library services
2. Empirical Data (including metadata about the method etc) - from raw experimental output to analysed data to condensed, extracted, summarised and published reports
3. Funding & Project - from grant application, through data and reports through to final reports and project reviews.
For example, CLRC hopes to link together information related to all aspects of an experiment, from the application for beamtime on the synchrotron, through the instrument parameters for the run, the raw experimental data, the archived processed data, and the reports and papers from the experiment.
Furthermore, the extant body of scientific information in each of these classes is increasing rapidly with time. So is the degree to which scientists need to collaborate in gathering, storing and analysing that resource. Today, the Internet is the natural way of sharing and collaborating with smaller-scale data and analysis but the size and complexity of the data-sets that can be shared and analysed online is growing rapidly.
With the development of the Grid computational infrastructure underway, now is the appropriate strategic time to develop the information management infrastructure that will exploit that resource to open up access to the enormous corpus of scientific information as it expands.
OAI, based at Cornell University and supported by two consortia (Digital library Federation, Coalition for Networked Information), provides the Open Archive Metadata Harvesting protocol that runs through web servers and clients to connect Data Providers to Data Services. The data provider is somebody who archives information on their site, while the service provider runs the OAI protocol to access the metadata. The OAI provide web pages for data providers and service providers to register so that they can know of each other's existence, and thereby bring about interoperable access.
The OAI protocol was originally designed for e-prints, although OAI acknowledge that it needs to be extended to cover other forms of digital information. The OAI protocol demands that archives use the Dublin Core metadata format, although parallel sets of metadata in other formats are not prohibited.
The main commercial and public domain alternative to the OAI protocol is the IEEE Z39.50 one which is widely used by large archives. Z39.50 has been available in implementations since the early 1990's whereas OAI was only released in January 2001, so it is considerably more recent, although a revised version 2.0 is expected in the second quarter of 2002. The contrast between the two is that Z39.50 supports greater functionality than OAI and therefore is more complex to implement in the HTTP server and the client.
The ePrint software is one implementation of an OAI protocol conforming server implemented in Perl on the UNIX operating system. It draws on a set of freely publically available tools (MySQL, Apache etc..). Eprint V2 was released on Feb 14th 2002, providing improved installation and database text indexing and normalising. The ePrint software has been taken up by a growing number of sites worldwide who have produced substantial local archives as document data providers.
The limitation in OAI is in the limitation of the metadata representation to Dublin Core only. The ePrint software uses a more complex nested metadata structure to represent archived items internally. This is accessible for search from a web page on the archive machine itself. However, the interoperable service provided to OAI data services only supports the Dublin Core subset of this.
The main alterations expected to the ePrint system would be:
1) Improve the documentation which is at present is minimal.
2) Improve the installation procedures which at present require a skilled UNIX administrator to install the software. This should be done using one of the commercial installation packages such as InstallAnywhere. These should be able to be performed by an experienced UNIX user.
3) The structure of the metadata representation used currently cannot be changed in the time proposed for the project. It will therefore be necessary to map the existing document metadata structure to the hierarchical structure used for data metadata.
4) A new metadata format needs to be defined within the ePrints system to support this mapping.
The ePrints and self-archiving initiative does not undertake the filtering function of existing libraries and archives, nor their indexing function or preservation function. These will be addressed by complementary digital library initiatives.
The document archiving facilities of the Eprints software, developed by Southampton University, can now be extended to provide storage for raw scientific data as well as the capability of interoperable processing. There is already momentum for widespread adoption of the Eprints software by universities and research institutions worldwide for research report archiving. It is natural to draw upon that momentum for this generalization of the initiative to data archiving. It is also natural for the UK to consolidate its lead in the overall digital archiving area in this new and important way.
Figure 1: Open Archiving Article Publication Process
The influence of the Internet, and latterly the World Wide Web, has been both to open up and speed up the publication process. Electronic journals can turn over their submissions more quickly than previously, while the opportunities for institutional self-archiving have allowed both un-refereed pre-prints and refereed post-prints (collectively known as e-prints) to be widely and instantaneously disseminated. Studies of the Los Alamos High Energy Preprints archive show that most articles are read within a week of their initial deposit and begin to be cited in new articles a month later. Each article has a number of revisions which correct mistakes in earlier versions or address issues raised by referees, leading to a publication process as shown in Figure 1 and a cycle of interacting articles in Figure 2.
Figure 2: Open Archiving Article Publication Cycle
The work with Open Archives highlights the fact that articles cannot be considered as ‘point-like events’ which suddenly come into being. Instead, they have to be seen as part of a larger continuum of scientific research and communication. This process has consisted of the loosely synchronised reporting, reading, evaluating, publishing and citation of research results. Scientific dissemination occurs as communities of practice, with conference, workshop and seminar communication supplementing the formal modality of communication: journal publication. The Internet provided an informal method for trading preprints; the HEP Archive grew from such beginnings into a major piece of research infrastructure with significant journals built atop it [JHEP].
This model extends the process of publication into a continuum of dissemination; however, it begins arbitrarily at the point at which the first draft of a report is publically deposited in an archive. In reality, the scientific process consists of a continuous cycle of reading, experimentation, authorship, review and publication [OSP]. The purpose of experimentation is to formulate and test the truth of hypotheses; the purpose of authorship is to report the results of the experimental process in a written form; the purpose of review is to test the quality of the findings and the purpose of publication is to disseminate those findings.
In order to accommodate a fuller model of scientific dissemination than the hitherto report-oriented publication cycle, we need to be able to deal with experimental data.
Some of this capability is already in place, but further interoperability needs to be designed for information that does not consist only of text objects or multimedia sound/images, but arbitrary raw data. The existing metadata structure supports ePrints that consist of a set of documents which in turn consist of sets of e-print files (e.g. an html document can consist of a set of files). The standard scientific data metadata that CLRC use on their Data Portal has a structure built around a study which consists of a set of experiments, each of which consists of a set of data files. It is clear that a mapping exists between the two three layered structures that can be used to represent experimental studies with the existing technology. Publishing models that are not based on a scientific communication as ePrints is ePrints is, and do not take into account revisions and multiple documents within the first order entity could not support this mapping.
To implement this new metadata structure it will be necessary to implement:
1) new strings bundle files for the new metadata input and browse facilities on the server
2) a new mapping from the data metadata to Dublin Core for OAI access
3) new metadata fields for the representation
Critical to the success of such initiatives is uptake. It is essential that installation should be as simple as possible and maintenance should be minimal. Software will be open-source and free thus requiring central funding). A model is needed, both for University Provosts/Presidents and for University Libraries, to show them how they can implement and facilitate the self-archiving of all their refereed research documents as well as their data.
OAI-compliant document and data providers will also require service clients. An objective of the project will accordingly be not only to provide a demonstrator service, but a client that can be used to access the service. University of Southampton have such a client as part of the OpCit project.
The long-term aim is the integrated management of all e-information relevant to the scientific research community. The immediate objective is to provide a tool to allow the UK science community to archive their own data and documents, making them available for secondary analysis. The proposed project will focus on the key objectives described below.
1. Analyse the requirements for protocols, indexing methods, brokers, indexing tools and search tools to extend Eprints to address data as well as documents.
2. Design and implement extensions to the the Eprints tool to archive data.
3. Document the ePrints tool.
4. Improve the installation procedures for the ePrints tool.
5. Develop a client tool using OAI protocol to allow access to the distributed ePrints network.
6. Design and implement a simple portal that allows uses to acquire the new tool and to register their archive, and to support the evaluation of the approach.
7. Evaluate the effectiveness (recall and precision), efficiency (time, costs and performance demands), usability, and learnability of the portal and tool.
8. Produce a feasibility report on further developing this architecture to scale up to the whole science data and documentation cycle, integrating the portal with existing archives and resource discovery services.
Figure 4: Project Dependencies
The experts in scientific data structure with experience in developing a data portal for e-science at CLRC will work with the developers of the Eprints tool at the University of Southampton to draft a set of requirements for the open archiving of scientific data and documents throughout the science process. This will be based upon a business model of the scientific process, as well as on the protocols used by existing archives, the needs for authentication and authorisation, and planned UK initiatives such as DNER.
The metadata structure required to support scientific data will be specifc, providing an initial overall design of extension to e-print required to incorporate it.
The UI of the Eprints tool and its supporting functionality will be extended to incorporate data as well as documents, by modifying the metadata representation and the strings files for the UI, as well as the mapping to Dublin Core for the data metadata.
The delivery and installation procedures for e-Print will be improved and made more robust to enable a wider public audience to adopt it by using one of the standard installation tools such as InstallAnywhere.
The documentation of the ePrints tool will be extended and made more usable and useful.
To develop a client tool that would allow access to the ePrints system using OAI protocols as a service.
A portal will be designed and implemented to allow the downloading of the Eprints tool, the registering of archives created with it, and linking to existing science archives. This portal will server as a populated demonstrator of the technology.
The tool will be made available to the public through the portal for download and use. Also a detailed diagnostic evaluation will also be made of its use by scientists at the CLRC RAL. The third form of evaluation will address the effectiveness (recall and precision), efficiency (time, costs and performance demands), usability, learnability, of the portal and tool.
On the basis of the requirements, and the evaluation of the tool and portal, scalability estimates to new protocols, data volumes etc.. will be calculated to assess the feasibility of further developing the approach and integrating it with DNER and the DataGRID. Scalability to a wider solution would involve extending the protocol set beyond OAI to include Z39.50 Web Services protocols such as UDDI, SOAP and WDSL and possibly Globus protocols. This extended protocol set would have to be planned for in the scalability plans.
Management by University of Southampton will be by senior academic staff who are paid for research as part of their normal contracts. Project management at CLRC requires staff to be charged to projects. Therefore one person month of effort will be used over the six months of the project..
(1) Southampton - Stevan Harnad, Les Carr, Wendy Hall
IAM (Intelligence, Agents, Multimedia)
Department of Electronics & Computer Science,
University of Southampton,
SO17 1BJ, UK
(2) CLRC-RAL - Keith Jeffery, Juan Bicarregui
Busienss and Information Technology Department
CLRC Rutherford Appleton Laboratory
Oxon, OX11 0QX
Tel: 01235 44 6619
Fax: 01235 44 5831
The Central Laboratory to the Research Councils (CLRC) is a UK government laboratory located at three sites in the UK, of which the Rutherford Appleton Laboratory (RAL) is the largest (http://www.clrc.ac.uk/). CLRC employs 1800 staff, of which 1200 are locate at RAL. CLRC had an annual turnover in 1997/98 of £91.9M, of which 5% came in grants from the EU, while 10% came from commercial contracts, with the remainder from UK Government research grants.
CLRC undertakes research in a wide range of disciplines including Space Science (with ESA and NASA), Laser Technology, Microfabrication, Particle Physics (with CERN), synchrotron radiation, neutron & muon sources to investigate materials, medium energy ion scattering to study metal surfaces, radio propagation, and IT.
The Business and Information Technology Department employs about 120 staff researching and developing IT, including considerable experience in EU projects funded under Esprit II, III, IV, Telematics, RACE and ACTS programmes (http://www.bitd.clrc.ac.uk/). The UK Office of W3C, the World Wide Web Consortium, is also housed within the IT department of CLRC, through which we have been actively involved in guiding the development of the web, and the technologies it employs.
CLRC BITD has experience of developing web portals for e-science such as the data portal (http://www.escience.clrc.ac.uk/Activity/ACTIVITY=DataPortal;), and considerable experience in metadata for scientific data and information services.
centralized eprint archives like the Los Alamos National Lab Physics Archive
LANL Archive (mirrored at http://xxx.soton.ac.uk)
CogPrints (Cognitive Sciences) Archive (http://cogprints.soton.ac.uk).
OpSIS ["Open-Sourcing Instititional Self-archiving software"].) grant is from DNER
(sub)-Initiative of OAI! Universities need to mandate online CVs with links to all their researchers' refereed papers in the University's Eprint Archives; and Libraries need to stand read to "self-archive" their own researchers' papers for them by proxy, as
CalTech: http://cav2001.library.caltech.edu/permission.html for details .Lrarians handle most document-management and metadata issues. For the repositories we just registered, this work fell primarily on Kim Douglas (Director of the Sherman Fairchild Library and Head of Technical Information Services) and Hema Ramachandran (Reference Librarian). However, all librarians are getting involved as they are signing up new options.
Kim worked with Adam Cochran (Caltech's Intellectual Property Counsel) on copyright issues. Authors who deposit material in our repositories retain copyright, but grant a non-exclusive, royalty-free license to Caltech. (See Of course, Caltech makes this material available for free. (See http://cav2001.library.caltech.edu/copyright.html for details.) The technical aspects are taken care of by the Library Information Technology group members Ed Sponsler and Betsy Coles. Ed focuses on Eprints and Betsy on ETDs. Both work closely together, and both support our librarians as they learn to conquer all the document-management issues related to these projects.
ESRC UK data archive (University of Essex) - similar in other countries through CESSDA – Common DDI metadata standard in XML – the Common Data Documentation Initiative (DDI). Use NESSTAR as common access tool (client and servers).
Art and Humanities Data Service (gateway to UK arts archives) – King’s London.
RDN - Resource Discovery Network, JISC UK network including Athens Authentication and Authorisation server
DNER - JISC planned replacement for RDN including brokers
[OSP] Issues for Science and Engineering Researchers in the Digital Age, Office of Special Projects, Policy and Global Affairs, National Research Council. NATIONAL ACADEMY PRESS Washington, DC http://www.nap.edu/html/issues_digital/index.html