Extracting corpus specific knowledge bases from Wikipedia

Milne, David N.; Witten, Ian H.; Nichols, David M.

Extracting corpus specific knowledge bases from Wikipedia

Authors

Milne, David N.

Witten, Ian H.

Nichols, David M.

Files

content.pdf (93.08 KB)

Permanent Link

https://hdl.handle.net/10289/69

Abstract

Thesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with thousands of dedicated amateur volunteers--namely, those that are producing Wikipedia. This vast, open encyclopedia represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide WikiSauri: manually-defined yet inexpensive thesaurus structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We also offer concrete evidence of the effectiveness of WikiSauri for assisting information retrieval.

Citation

Milne, D., Witten, I.H. & Nichols, D.M. (2007). Extracting corpus specific knowledge bases from Wikipedia. (Working paper series. University of Waikato, Department of Computer Science. No. 03/2007). Hamilton, New Zealand: University of Waikato.

Type

Working Paper

Series name

Computer Science Working Papers

Date

2007-06-01

Publisher

University of Waikato, Department of Computer Science

Extracting corpus specific knowledge bases from Wikipedia

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor