Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computing and Mathematical Sciences Papers
      • View Item
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computing and Mathematical Sciences Papers
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Clustering documents with active learning using Wikipedia

      Huang, Anna; Witten, Ian H.; Frank, Eibe; Milne, David N.
      Thumbnail
      Files
      08-AH-IHW-EF-DM-Activeclustering.pdf
      203.4Kb
      DOI
       10.1109/ICDM.2008.80
      Find in your library  
      Citation
      Export citation
      Huang, A., Witten, I. H., Frank, E. & Milne, D. (2009). Clustering documents with active learning using Wikipedia. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, December 15-19, 2009. (pp. 839-844). Washington, DC, USA: IEEE Computer Society.
      Permanent Research Commons link: https://hdl.handle.net/10289/3557
      Abstract
      Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.
      Date
      2009
      Type
      Conference Contribution
      Publisher
      IEEE Computer Society
      Rights
      ©2009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
      Collections
      • Computing and Mathematical Sciences Papers [1455]
      Show full item record  

      Usage

      Downloads, last 12 months
      56
       
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement