Clustering documents using a Wikipedia-based concept representation

Loading...
Thumbnail Image

Publisher link

Rights

This is an author’s accepted version of an article published in Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-29. ©2009 Springer-Verlag Berlin Heidelberg.

Abstract

This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.

Citation

Huang, A., Witten, I. H., Frank, E. & Milne, D. (2009). Clustering documents using a Wikipedia-based concept representation. In Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009. (pp. 628-636).

Type

Series name

Date

Publisher

Springer

Degree

Type of thesis

Supervisor