Semantic document representation: Do It with Wikification
Witten, I. (2012). Semantic document representation: Do It with Wikification. In Lecture Notes in Computer Science, 2012, Volume 7608 String Processing and Information Retrieval: 19th International Symposium, SPIRE 2012,pp. 17-17.
Permanent Research Commons link: http://hdl.handle.net/10289/6933
Wikipedia is a goldmine of information. Each article describes a single concept, and together they constitute a vast investment of manual eﬀort and judgment. Wikiﬁcation is the process of automatically augmenting a plain-text document with hyperlinks to Wikipedia articles. This involves associating phrases in the document with concepts, disambiguating them, and selecting the most pertinent. All three processes can be addressed by exploiting Wikipedia as a source of data. For the ﬁrst, link anchor text illustrates how concepts are described in running text. For the second and third, Wikipedia provides millions of examples that can be used to prime machine-learned algorithms for disambiguation and selection respectively. Wikiﬁcation produces a semantic representation of any document in terms of concepts. We apply this to (a) select index terms for scientiﬁc documents, and (b) determine the similarity of two documents, in both cases outperforming humans in terms of agreement with human judgment. I will show how it can be applied to document clustering and classiﬁcation algorithms, and to produce back of the book indexes, improving on the state of the art in each case.