Extracting corpus specific knowledge bases from Wikipedia

dc.contributor.authorMilne, David N.en_US
dc.contributor.authorWitten, Ian H.en_US
dc.contributor.authorNichols, David M.en_US
dc.date.accessioned2008-03-19T04:58:28Z
dc.date.available2007-07-02en_US
dc.date.available2008-03-19T04:58:28Z
dc.date.issued2007-06-01en_US
dc.description.abstractThesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with thousands of dedicated amateur volunteers--namely, those that are producing Wikipedia. This vast, open encyclopedia represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide WikiSauri: manually-defined yet inexpensive thesaurus structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We also offer concrete evidence of the effectiveness of WikiSauri for assisting information retrieval.en_US
dc.format.mimetypeapplication/pdf
dc.identifier.citationMilne, D., Witten, I.H. & Nichols, D.M. (2007). Extracting corpus specific knowledge bases from Wikipedia. (Working paper series. University of Waikato, Department of Computer Science. No. 03/2007). Hamilton, New Zealand: University of Waikato.en_US
dc.identifier.urihttps://hdl.handle.net/10289/69
dc.language.isoen
dc.publisherUniversity of Waikato, Department of Computer Science
dc.relation.ispartofseriesComputer Science Working Papers
dc.subjectWikipediaen_US
dc.subjectdata miningen_US
dc.subjectthesaurien_US
dc.subjectdisambiguationen_US
dc.subjectsemantic relatednessen_US
dc.subjectquery expansionen_US
dc.subjectMachine learning
dc.titleExtracting corpus specific knowledge bases from Wikipediaen_US
dc.typeWorking Paperen_US
pubs.elements-id53449
uow.relation.series03/2007
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
content.pdf
Size:
93.08 KB
Format:
Adobe Portable Document Format