Hypermedia assists the user in navigating the available information space; this space is diverse and distributed, populated by multimedia objects which may be persistent, constantly updated, and perhaps dynamically generated. Navigation may involve pre-authored links (‘buttons’) and links computed dynamically [Hall 94], but navigation techniques based on links alone have proved insufficient for the World Wide Web (WWW) and keyword-based search engines have to be used to augment navigation strategies. By indexing a large proportion of the documents on the WWW these can provide pointers to huge numbers of apparently relevant documents, although frequently with low precision measures. However, there is no standard for the use of keywords and consistent keyword descriptions are as elusive as consistent links. In a global information environment, where the number and size of potential information resources is huge, of variable quality and aimed at diverse audiences, the focussed retrieval of pertinent information becomes a vital issue.
The aim of this project is to research into methods to improve significantly the quality, consistency and breadth of linking of WWW documents at retrieval time (as readers browse the documents) and authoring time (as authors create the documents). It plans to produce a COHSE (Conceptual Open Hypermedia Services Environment) using three leading-edge technologies:
The result of the integration will be evaluated and refined with two real case study applications drawn from commercial collaborators.
Metadata is data that describes other data to enhance its usefulness. The library catalogue or database schema are canonical examples. For our purposes, metadata falls into three broad categories:
Metadata activities have been a major focus of interest for the WWW community, especially for information providers, publishers and digital libraries. The takeup of the eXtensible Markup Language (XML) has been particularly concerned with its applications for expressing data about documents, and has most recently been used to define the Resource Description Framework (RDF) [Lassila 99]. The aim of RDF is to provide a standard framework for expressing statements about data objects, especially statements giving information about authors, publishers, version and keyword information (these attributes being standardised as the Dublin Core [Thiele 98]).
Providing conceptual content-based information as the attributes of XML tags is seen as a crucial activity, enabling search engines to provide query results which are more pertinent. Currently such concepts are simple keywords. Hypermedia systems such as the the Distributed Link Service [Carr 95; Carr 96], may make use of this information to provide a rudimentary "conceptual hypermedia" by clustering documents with the same tag value keyword for retrieval purposes and linking documents with the same tag value for navigation. Keywords effectively classify documents into clusters that share the same set of keywords, or variations of them if stemming is used.
Consistent keyword descriptions are difficult to create and subsequently maintain, leading to an incoherent model of concepts and hence inaccurate linking, since each document will have many possible interpretations [Bruza 91]. Consequently, some communities have developed domain-specific controlled vocabularies, or terminologies, based around a thesaurus of language terms, for example The Art and Architecture Thesaurus (AAT) [Peterson 94] or WordNet [Fellbaum 98], a general language thesaurus based on semantic nets. These and others, for example ACM Computing Classification System [Coulter 98], are based on abstract hierarchical classification schemes and are used as classification schemes. However, their use as such is seriously hampered as they are largely predetermined, static and often unsound single classification hierarchies resembling "phrase books" [Bechhofer 97]. They are not based on a systematic ontology - an explicit, rigorous, declarative specification of concepts - but are principled organisations of linguistic terms or phrases that do not have a rigorous fixed interpretation other than that attributed to them through human interpretation. This lack of rigour makes them hard to: browse, use as a querying device, check for coherency, extend in a principled way and reason about [Bechhofer99].
Compositionally based terminologies are much more powerful, resembling a collection of elementary concepts assembled according to a "grammar" to form complex composite concepts that are sound. Pre-and post-co-ordinated facet thesauri attempt to control the combination of terms on document instances but in complicated ways [Aitchision 87; Lanchester 96]. To be more effective such terminologies are best represented in a knowledge representation scheme that is expressive and can intrinsically support dynamic and automatic classification of complex composite concepts based on their components; such a scheme is a Description Logic (DL). Conventional frames or semantic networks (such as those used by WordNet) do not have the logical concept subsumption and satisfiability reasoning services offered by DLs and are consequently less flexible when constructing and evolving the conceptual network and using it for retrieval.
The University of Manchester has developed a Terminology Server (an ontology server + a language terms service) based on a Description Logic, GRAIL. The STARCH project has extended this to describe and query the metadata of a collection of stock photography drawn from the Hulton-Getty collection of Getty Images. A domain-specific thesaurus constructed using a Description Logic supports the automatic management of the thesaurus through [Bechhofer 99]:
Common usage of the Web involves embedding links within documents in the HTML format; in this sense the Web can be considered a ‘closed’ hypermedia system. However, there is nothing inherent in the Web infrastructure that prevents links from being abstracted from the documents and managed separately, as is made possible by XML’s proposed XLink standard [Maler 98]. In Open Hypermedia Systems (OHS) links are first class objects, stored and managed separately from multimedia data; like documents they can be stored, transported, cached and searched, and their use can be instrumented. OHS have been well researched by the hypermedia community [Osterbye 96] and increasingly Web publishing applications adopt the open hypermedia approach [Thistlethwaite 97; Lowe 98].
The University of Southampton’s Distributed Link Service (DLS) implements an open hypermedia system above the infrastructure of the World Wide Web [Carr 95; Carr 96; Carr 98]. This provides a powerful framework to aid navigation and authoring and solves some of the issues of distributed information management [DeRoure 96]. Using an intermediary model [Barrett 98], the DLS adds links and annotations into documents as they are delivered through a proxy from the original WWW server to the ultimate client browser. It uses a number of software modules to recognise different opportunities for adding various kinds of links to the documents, creating a user-specific navigational overlay that can be used to superimpose a coherent interface to sets of unlinked or insular resources (such as the journal archives addressed by the Open Journals project [Hitchcock 98]).
To achieve the kind of diversity of association required for today’s Web applications, documents need to be linked in many dimensions based on their content. Constructing such links manually is inconsistent and error-prone [Ellis 96]. Furthermore, it obfuscates one of the chief reasons for associating documents; that their contents are similar in some way. Conceptual Hypermedia Systems (CHS) specify the hypertext structure and behaviour in terms of a well-defined conceptual schema [Nanard 91; Bruza 90; Tudhope 97]. This types documents and links, and includes some kind of conceptual domain model used to describe document content. Consequently, information about the hypertext is represented explicitly as metadata that can be reasoned over, for example using the domain model as a classification structure to classify the documents; documents that share metadata are deemed to be similar in some way. Authoring links between documents becomes an activity of authoring with concepts [Beitner 95]; concepts are linked and hence their associated documents are linked.
The University of Manchester’s TourisT prototype experimented with a conceptual hypermedia approach for a Tourism Public Access System [Bullock 98]. As the relationships between the concepts in the domain model evolve so do the links; as document concept descriptors change so do the links, making this a potentially powerful linking mechanism.
Increasingly, WWW applications adopt the open hypermedia approach of managing links as first-class objects, separate from the documents they describe. Southampton’s Distributed Link Service treats link creation and resolution as a service which may be provided by a number of link resolution engines. For example, the DLS uses resolvers which recognise keywords, names of people and bibliographic citations as potential link anchors according to different heuristics and knowledge bases. The link resolvers are hardwired into the monolithic system or chained sequentially [DeRoure 99] so that each one sees the document with links added by the previous resolver. This is an inherently synchronous arrangement, a side effect of which is to increasingly delay the eventual delivery of the linked document as the amount of processing expended in determining links increases.
Manchester has developed a Terminology Service (TeS) [Bechhofer97] that represents a controlled vocabulary using the Description Logic GRAIL, and presents it to external software components through an API. It is used as the conceptual model service of the TourisT conceptual hypermedia and the STARCH picture retrieval, and supports linking by example (through query by example), conceptual browsing of documents and automatic classification of documents through their metadata descriptors.
Within the framework of a suitably adapted DLS, a terminology-based query system could be added to the portfolio of link resolvers to provide consistent navigation links based on the concepts contained in the contents or meta-data of the multimedia pages that are being browsed. We propose to change the architecture of the DLS, implementing a scalable open hypermedia framework, based on the concept of query routing [DeRoure 99], within which many independently co-operating link resolvers may operate to work in conjunction with the TeS in order to produce a new Conceptual Open Hypermedia Services Environment (COHSE) to achieve effective navigation of large, ill-structured information resources. We will then apply this to the Web using XML and RDF metadata management capabilities. We will demonstrate and evaluate the environment in two areas for which we have terminological models: stock photography and biological publications.
By taking advantage of the Description Logic’s ability to offer automatic and dynamic multi-dimensional classification of concepts (and hence documents) we propose investigating the use of such an augmented open hypermedia link service to improve the consistency of the Web authoring and design process.
The specific research aim is to identify and resolve the issues in deploying COHSE to support large scale Web authoring. Although the approach seems straightforward there are significant research challenges to overcome. The major challenges can be divided into those concerned with:
By taking advantage of a Description Logic’s ability to offer automatic and dynamic multi-dimensional classification of concepts (and hence documents) we aim to improve the consistency of the Web authoring and design process. We propose a WWW system (COHSE: a Conceptual Open Hypermedia Services Environment) that integrates a domain-specific terminology constructed using a Description Logic which supports the dynamic classification of terms, the automatic management of the terminology and the flexible cataloguing of documents to support authoring and browsing users of the WWW.
The COHSE must cater for linking based on more computationally intensive processing, such as a terminology service [Carr 98]. Although very large terminological models have been built for medical terms and molecular biological concepts neither has been used within a hypermedia browsing context where there are issues of scalability in thesaurus construction and maintenance, document cataloguing and classification and of interacting with the thesaurus or querying and browsing the documents.
To apply Description Logic techniques to the management of metadata to produce a consistent framework for navigating, searching and linking hypermedia resources on the World Wide Web. Integration of a terminology server as a linking service in an open hypermedia system, allowing links to be followed between documents with the same or similar concept descriptors. An investigation of the use of content analysis strategies (such as statistical keyword derivation in texts, citation recognition in academic articles and melodic extraction in audio streams) and domain-specific thesauri, ontology or classification schemes already available, to create metadata for as-yet unclassified documents and the application of content analysis tools to the automatic classification of documents and updating of the core ontology. The result will contribute to ontology-building methodologies [FOIS98].
To build a core ontology for the resources provided by the commercial partners using techniques in 1 and leveraged from ontologies developed for (TAMBIS BIF/05344) and (STARCH GR/L71216) and Getty Images’ own classification scheme. To integrate the various content and metadata management processes within a unified architecture developed in 2, and provide a low-impact WWW user interface. To investigate the benefit of these techniques for classifying and searching WWW resources, authoring new WWW resources and creating new perspectives on existing resources.
The COHSE system, whose architecture is shown in figures 1 and 2, builds on components that are proved in other projects (Southampton’s DLS in [Carr 98] and Manchester’s TeS in [Bechhofer 99; Bullock 98]). COHSE consists of the DLRS framework with a set of hypermedia link resolvers including a specialised ‘concept link resolver’ which incorporates the ontology functionality. The main body of the work in this proposal is to discover, store and reason about the metadata associated with collections of documents, and to do so by using TeS-derived software as a module within the enhanced DLRS browsing environment.
Prototype Initial work will be directed at integrating the existing components, allowing the DLS as a simple document linking service to incorporate the facilities of the TeS, and adding extra infrastructure to the TeS to allow it to function in a WWW environment. A new method for hypertext link presentation must be developed to make the ontology appropriately visible.
DLRS Hypermedia Framework The current implementation of the DLS has made the various resolvers either hard-wired as part of the proxy itself, or are spawned as child sub-processes that pre- or post-process the WWW document as it travels through the proxy. This architecture has been sufficient while the processing model has conformed to that of a WWW proxy, i.e. all processing must be achieved synchronously with the delivery of the document. Any delay is a delay in the critical path of document delivery, hence all processing must be relatively light-weight and tightly coupled. A significant part of this proposal is to design and implement a framework to allow more autonomous processing (DLRS), based on query routing techniques [DeRoure 99] in which heavy duty services can be invoked in a controlled manner which is decoupled from the delivery of the document. This will allow complex computation, such as involved with implementing a description logic, to provide added value for document authoring and browsing without impeding the delivery of the core document itself.
TeS WWW Gateway Although Southampton’s Mavis project [Lewis 96] is a thesaurus-based system, it is aimed specifically at matching multiple representations of multimedia data fragments. Instead, a Terminology Server and Query Engine which have been used by Manchester University in a number of projects (STARCH, TourisT) will form the core of the Ontology Resolver shown in figure 2. In order to allow interoperation with the WWW, a new client will be created for the Terminology Server which will act as a WWW Gateway. This will enable a user to view the concepts, vocabulary and classification networks that have been built by the server.
Metadata Manager Any document that is viewed by a user of the system may or may not have some previously defined metadata (including keywords or specific classification information). The Metadata Manager will be responsible for discovering the existing metadata so that it can be passed on to the Terminology Server for further processing. Given a document URL, the Metadata Manager must retrieve any associated RDF data, extract any explicit META-tags in the heading of an HTML document, recognise and decode any explicit classification information using well-known classification schemas (such as ICONCLASS [Waal 85] or AAT). If no explicit metadata is available for the document, then content analysis strategies will be invoked, such as searching for explicit author-supplied keywords in the document text, or using keyword recognition software (for example, automatic themeing software from Multicosm Ltd). The metadata provided will be used either to position the document within the terminology (as an instance of a particular complex concept), or to extend and update the terminology dynamically (also requiring the reclassification of documents).
Terminology Manager The Terminology Manager has the role of driving updates for the ontology as part of the normal browsing process. As the user views documents, the URLs and derived metadata associated with the viewed documents are passed to the Terminology Manager. If the keywords do not map onto concepts existing in the ontology which are already stored in the Terminology Server, the manager is responsible for creating new concepts.
Ontology Construction & Refinement Specific collections of documents have been chosen for case studies (provided by the commercial partners). Documents from a museum collection already have classification metadata associated with them; those from an academic publisher will have classification applied to them. For both collections, a core ontology will be defined, and the Terminology Manager will be used for automatic maintenance and update.
Evaluation The use of the Terminology Manager described above will be compared against creating and maintaining the full ontology by hand. The navigation facilities that COHSE provides for the user will be compared with standard (static) hierarchical organisation and search facilities. Evaluation of this kind of project is difficult; the criteria for success are qualitative rather than quantitative. Measures used in IR systems such as precision and recall are inappropriate to serve the needs of archives search tasks [Bechhofer 99]. An appropriate approach is through our case study applications and a series of search navigation scenarios developed in, for example [Bullock 99, Bates98, Garber92], and by our collaborators. Evaluation is thus in two parts: a) the soundness of the architecture using traditional metrics and b) the satisfaction of the link and search scenarios by the case study hypermedia.
Aitchison, J. and Gilchrist, A. Thesaurus construction – A practical manual, 2nd ed: Aslib. London, 1987.
Barrett, R. and Maglio, P.P. (1998) Intermediaries: new places for producing and manipulating web content, In Proceedings of Seventh WWW Conference, Brisbane.
Bates, M. (1998) Indexing and Access for Digital Libraries and the Internet: Human, Database and Domain Factors, in Journal of American Society for Information Science 49(13):1185-1205, (1998)
Bechhofer, S. and Goble, C. (1999). Classification Based Navigation for Picture Archives. To appear in Proceedings of IFIP WG2.6 Conference on Data Semantics, DS8, New Zealand, Kluwer.
Bechhofer, S., Goble, C.A., Rector, A.L., Solomon, W.D. (1997) Terminologies and Terminology Servers for Information Environments. Proc. IEEE Conf on Software Technology Experience & Practice, 484-497.
Beitner, N., Hall, W., Goble, C.A. Putting the media into hypermedia. Proc. SPIE Multimedia Networking (1995).
Bruza, P.D. (1990) Hyperindices: a novel aid for searching in hypermedia, In Proceedings of the 1990 ACM Hypertext, 109-122.
Bruza P.D. The modelling and retrieval of documents using index expressions. SIGIR Forum 25, 2 (1991), 91-103.
Bullock J. (1999) Informed Navigation: Description Logic Based Hypermedia Linking PhD Thesis, University of Manchester.
Bullock, J., and Goble, C. (1998). TourisT: The Application of a Description Logic based Semantic Hypermedia System for Tourism, In Proceedings of the Ninth ACM Hypertext Conference, Pittsburgh. ISBN 0-89791-972-6
Carr, L., De Roure, D., Hall, W., Hill, G., (1995) The Distributed Link Service: A Tool for Publishers, Authors and Readers, World Wide Web Journal 1(1), 647-656, O'Reilly & Associates.
Carr, L., Davis, H., De Roure, D., Hall, W., Hill, G. (1996) Open Information Services, Computer Networks and ISDN Systems, 28 (7/11), 1027-1036, Elsevier.
Carr L., Hall, W., Hitchcock, S., (1998) Link Services or Link Agents?, In Proceedings of the Ninth ACM Hypertext Conference. pp 113-122.
Coulter, A. (1998) Computing Classification System 1998: Current Status and Future Maintenance. Report of the CCS Update Committee, Computing Reviews, Jan 1998, 1-5.
DeRoure, D., L. Carr, W. Hall and G. Hill (1996) A Distributed Hypermedia Link Service, In Proceedings SDNE96, IEEE Computer Society Press.
DeRoure, D., El-Beltagy, S., Gibbins, N., Carr, L. and Hall, W. (1999) Integrating Link Resolution Services using Query Routing, In Proceedings of the Fifth Open Hypermedia Workshop (in press).
Ellis, D., Furner, J., Willett, P. On the creation of hypertext links in full-text documents - measurement of retrieval effectiveness. Journal of the American Society For Information Science, 47(4) 1996, 287-300.
Fellbaum C. (ed.) (1998) WordNet: An Electronic Lexical Database, MIT Press, ISBN 0-262-06197-X
Garber S.R., Grunes, M.B. (1992) The Art of Search: A study of Art Directors in Proceeding CHI’92: 157-163
Guarino N. (ed.) (1998) Formal Ontology in Information Systems ed, IOS Press, June 1998.
Hall, W (1994) Ending the Tyranny of the Button, IEEE Multimedia 1,1, 60-68
Hitchcock, S., Quek, F., Carr, L., Hall, W., Witbrock, A. and Tarr I., (1998) Towards Universal Linking for Electronic Journals, Serials review, 24, No. 1 (Spring 1998) pp. 21-33.
Ingwersen, P. (1996) Cognitive perspectives of information-retrieval interaction - elements of a cognitive IR theory. Journal Of Documentation 52, 1, 3-50.
Lancaster, F.W. (1986) Vocabulary Control for Information Retrieval: Information Resources Press. Arlington, Virginia, 1986.
Lassila, O., and Swick, R. (eds) (1998) Resource Description Framework (RDF): Model and Syntax Specification. W3C Proposed Recommendation (January 1999). http://www.w3.org/TR/PR-rdf-syntax/
Lewis, P., Davis, H., Dobie, M. and Hall, W. (1996) Towards Multimedia Thesaurus Support for Media-based Navigation, In First International Workshop on Image Databases and Multimedia Search.
Lowe, D. and Hall, W. (1998) Hypertext and the Web: An Engineering Approach J. Wiley & Son.
Maler, E., DeRose, S. (eds.) (1998) "XML Linking Language (XLink)", World Wide Web Consortium Working Draft, 3-March-1998, http://www.w3.org/TR/1998/WD-xlink-19980303
Nanard J. & Nanard, M. (1991) Using structured types to incorporate knowledge in hypertext, In Proceedings of the 1991 ACM Hypertext Conference, 329-342.
Osterbye, K. and Wiil, U. (1996) The Flag Taxonomy of Open Hypermedia Systems, In Proceedings of the 1996 ACM Hypertext Conference, 129-139.
Peterson, T. (1994). Introduction to the Art & Architecture Thesaurus, OUP.
Thiele, H. (1998). "The Dublin Core and Warwick Framework: A review of the literature, March 1995 - September 1997." D-Lib, January 1998.
Thistlethwaite, P. (1997) Automatic Construction and Management of Large Open Webs, Information Processing and Management, 33(2), 161-173.
Tudhope, D., Taylor, C. (1997) Navigation via similarity: automatic linking based on semantic closeness. Information Processing & Management 33, 2,
233-242.
Waal, H.v.d. (1985) ICONCLASS: An Iconographic Classification System: Koninklijke Nederlandse Akademie van Wetenschappen.
Work Plan
|
Phase |
Location |
Work Package |
Deliverable |
|
|
A (3 months) Prototype |
Southampton |
|
Prototype COHSE using existing DLS and TeS software and WWW gateway |
Prototype platform |
|
Manchester & Southampton |
b |
Prepare Hulton-Getty archive for conformance to WWW standards (RDF). |
||
|
Manchester |
c |
Start to build initial terminology investigating the use of existing techniques applied automatically. |
||
|
B (6 months) Automaticupdating issues |
Manchester |
a |
Investigate the incremental thesaurus update issues |
Report, methodology & demonstration |
|
Southampton |
b |
Investigate the hypermedia link presentation issues (visibility, credence, integration) |
||
|
Manchester |
c |
Investigate the catalogue updating issues |
||
|
C (6 months) New Thesaurus |
Manchester |
a |
Automatic (and semi-automatic) bottom up thesaurus generation techniques |
Report, policies & demonstration |
|
Southampton |
b |
Investigate automatic keyword techniques, initially working with Themes software from Multicosm Ltd. |
||
|
Southampton & Manchester |
c |
Case study: Automatic classification and keywording of documents based on Company of Biologists digital archive. |
||
|
Southampton |
d |
Construct metadata manager |
||
|
D (12 months) Building the Software |
Southampton |
a |
Build basic DLRS (figure 1), comprising WWW proxy with asynchronous processing of document data. |
Final software |
|
Manchester |
b |
Re-engineer Terminology Server to produce Concept Resolver (figure 2) for DLRS. |
||
|
Southampton |
c |
Systems integration and hypermedia link presentation interface for final COHSE. |
||
|
E (6 months) Evaluating Final Environ-ment |
Manchester & Southampton |
a |
Case Study: Use of COHSE for WWW authoring based on resources from Hulton-Getty archive. |
Web site & Report |
Workplan Overview
|
Year 1 |
Year 2 |
|||||||
|
||||||||
|
Phase B |
||||||||
|
Phase C |
||||||||
|
Phase D |
||||||||
|
Phase E |
||||||||
Supporting Figures