Building a public digital library based on full-text retrieval

Witten, Ian H.; Nevill-Manning, Craig G.; Cunningham, Sally Jo

Item

Building a public digital library based on full-text retrieval

Witten, Ian H.
;
Nevill-Manning, Craig G.
;
Cunningham, Sally Jo

Abstract

Digital libraries are expensive to create and maintain, and generally restricted to a particular corporation or group of paying subscribers. While many indexes to the World Wide Web are freely available, the quality of what is indexed is extremely uneven. The digital analog of a public library a reliable, quality, community service has yet to appear. This paper demonstrates the feasibility of a cost-effective collection of high-quality public-domain information, available free over the Internet. One obstacle to the creation of a digital library is the difficulty of providing formal cataloguing information. Without a title, author and subject database it seems hard to offer the searching facilities normally available in physical libraries. Full-text retrieval provides a way of approximating these services without a concomitant investment of resources. A second is the problem of finding a suitable corpus of material. Computer science research reports form the focus of our prototype implementation. These constitute a large body of high-quality public-domain documents. Given such a corpus, a third issue becomes the question of obtaining both plain text for indexing, and page images for readability. Typesetting formats such as PostScript provide some of the benefits of libraries scanned from paper documents such as paged-based indexing and viewing without the physical demands and error-prone nature of scanning and optical character recognition. However, until recently the difficulty of extracting text from PostScript seems to have encouraged indexing on plain-text abstracts or bibliographic information provided by authors. We have developed a new technique that overcomes the problem. This paper describes the architecture, the indexing, collection and maintenance processes, and the retrieval interface, to a prototype public digital library.

Type

Working Paper

Series

Computer Science Working Papers

Citation

Witten, I. H., Nevill-Manning, C. G. & Cunningham, S. J. (1995) Building a public digital library based on full-text retrieval. (Working paper 95/24). Hamilton, New Zealand: University of Waikato, Department of Computer Science.

Date

1995-08

Publisher

University of Waikato, Department of Computer Science

Building a public digital library based on full-text retrieval

Witten, Ian H.
;
Nevill-Manning, Craig G.
;
Cunningham, Sally Jo

Abstract

Type

Type of thesis

Series

Citation

Date

Publisher

Degree

Supervisors

Rights

Files

Permanent link

DOI

Publisher version

Collections