Extracting corpus specific knowledge bases from Wikipedia

Milne, David N.; Witten, Ian H.; Nichols, David M.

Extracting corpus specific knowledge bases from Wikipedia

dc.contributor.author	Milne, David N.	en_US
dc.contributor.author	Witten, Ian H.	en_US
dc.contributor.author	Nichols, David M.	en_US
dc.date.accessioned	2008-03-19T04:58:28Z
dc.date.available	2007-07-02	en_US
dc.date.available	2008-03-19T04:58:28Z
dc.date.issued	2007-06-01	en_US
dc.description.abstract	Thesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with thousands of dedicated amateur volunteers--namely, those that are producing Wikipedia. This vast, open encyclopedia represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide WikiSauri: manually-defined yet inexpensive thesaurus structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We also offer concrete evidence of the effectiveness of WikiSauri for assisting information retrieval.	en_US
dc.format.mimetype	application/pdf
dc.identifier.citation	Milne, D., Witten, I.H. & Nichols, D.M. (2007). Extracting corpus specific knowledge bases from Wikipedia. (Working paper series. University of Waikato, Department of Computer Science. No. 03/2007). Hamilton, New Zealand: University of Waikato.	en_US
dc.identifier.uri	https://hdl.handle.net/10289/69
dc.language.iso	en
dc.publisher	University of Waikato, Department of Computer Science
dc.relation.ispartofseries	Computer Science Working Papers
dc.subject	Wikipedia	en_US
dc.subject	data mining	en_US
dc.subject	thesauri	en_US
dc.subject	disambiguation	en_US
dc.subject	semantic relatedness	en_US
dc.subject	query expansion	en_US
dc.subject	Machine learning
dc.title	Extracting corpus specific knowledge bases from Wikipedia	en_US
dc.type	Working Paper	en_US
pubs.elements-id	53449
uow.relation.series	03/2007

Files

Original bundle

Now showing 1 - 1 of 1

Name:: content.pdf
Size:: 93.08 KB
Format:: Adobe Portable Document Format

Name: content.pdf

Size: 93.08 KB

Kind: Adobe PDF

Collections

2007 Working Papers