Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computing and Mathematical Sciences Papers
      • View Item
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computing and Mathematical Sciences Papers
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Learning a concept-based document similarity measure

      Huang, Lan; Milne, David N.; Frank, Eibe; Witten, Ian H.
      Thumbnail
      Files
      Learning.pdf
      5.007Mb
      DOI
       10.1002/asi.22689
      Link
       onlinelibrary.wiley.com
      Find in your library  
      Citation
      Export citation
      Huang, L., Milne, D.N., Frank, E. & Witten, I.H. (2012). Learning a concept-based document similarity measure. Journal of American Society for Information Science and Technology, 63(8), 1593-1608 .
      Permanent Research Commons link: https://hdl.handle.net/10289/6561
      Abstract
      Document similarity measures are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine-learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.
      Date
      2012
      Type
      Journal Article
      Publisher
      Wiley
      Rights
      This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology. © 2012 (American Society for Information Science and Technology)
      Collections
      • Computing and Mathematical Sciences Papers [1455]
      Show full item record  

      Usage

      Downloads, last 12 months
      96
       
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement