Witten, I. H., Nevill-Manning, C. G. & Cunningham, S. J. (1995) Building a public digital library based on full-text retrieval. (Working paper 95/24). Hamilton, New Zealand: University of Waikato, Department of Computer Science.
Permanent Research Commons link: http://hdl.handle.net/10289/1101
Digital libraries are expensive to create and maintain, and generally restricted to a particular corporation or group of paying subscribers. While many indexes to the World Wide Web are freely available, the quality of what is indexed is extremely uneven. The digital analog of a public library a reliable, quality, community service has yet to appear. This paper demonstrates the feasibility of a cost-effective collection of high-quality public-domain information, available free over the Internet. One obstacle to the creation of a digital library is the difficulty of providing formal cataloguing information. Without a title, author and subject database it seems hard to offer the searching facilities normally available in physical libraries. Full-text retrieval provides a way of approximating these services without a concomitant investment of resources. A second is the problem of finding a suitable corpus of material. Computer science research reports form the focus of our prototype implementation. These constitute a large body of high-quality public-domain documents. Given such a corpus, a third issue becomes the question of obtaining both plain text for indexing, and page images for readability. Typesetting formats such as PostScript provide some of the benefits of libraries scanned from paper documents such as paged-based indexing and viewing without the physical demands and error-prone nature of scanning and optical character recognition. However, until recently the difficulty of extracting text from PostScript seems to have encouraged indexing on plain-text abstracts or bibliographic information provided by authors. We have developed a new technique that overcomes the problem. This paper describes the architecture, the indexing, collection and maintenance processes, and the retrieval interface, to a prototype public digital library.
University of Waikato, Department of Computer Science
- 1995 Working Papers