Huang, AnnaWitten, Ian H.Frank, EibeMilne, David N.2010-02-102010-02-102009Huang, A., Witten, I. H., Frank, E. & Milne, D. (2009). Clustering documents using a Wikipedia-based concept representation. In Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009. (pp. 628-636).https://hdl.handle.net/10289/3559This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.application/pdfenThis is an author’s accepted version of an article published in Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-29. ©2009 Springer-Verlag Berlin Heidelberg.Machine learningClustering documents using a Wikipedia-based concept representation10.1007/978-3-642-01307-2_62