Clustering documents with active learning using Wikipedia

Huang, Anna; Witten, Ian H.; Frank, Eibe; Milne, David N.

doi:10.1109/ICDM.2008.80

Clustering documents with active learning using Wikipedia

Authors

Files

08-AH-IHW-EF-DM-Activeclustering.pdf (203.49 KB)

Permanent Link

https://hdl.handle.net/10289/3557

DOI

10.1109/ICDM.2008.80

Rights

©2009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Abstract

Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.

Citation

Huang, A., Witten, I. H., Frank, E. & Milne, D. (2009). Clustering documents with active learning using Wikipedia. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, December 15-19, 2009. (pp. 839-844). Washington, DC, USA: IEEE Computer Society.

Type

Conference Contribution

Date

2009

Publisher

IEEE Computer Society

Clustering documents with active learning using Wikipedia

Authors

Files

Permanent Link

DOI

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor