Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computer Science Working Paper Series
      • 2013 Working Papers
      • View Item
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computer Science Working Paper Series
      • 2013 Working Papers
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Text categorization and similarity analysis: similarity measure, architecture and design

      Fowke, Michael; Hinze, Annika; Heese, Ralf
      Thumbnail
      Files
      uow-cs-wp-2013-12.pdf
      504.8Kb
      Find in your library  
      Citation
      Export citation
      Fowke, M., Hinze, A., & Heese, R.(2013). Text categorization and similarity analysis: similarity measure, architecture and design. (Working paper 12/2013). Hamilton, New Zealand: University of Waikato, Department of Computer Science.
      Permanent Research Commons link: https://hdl.handle.net/10289/8433
      Abstract
      This research looks at the most appropriate similarity measure to use for a document classification problem. The goal is to find a method that is accurate in finding both semantically and version related documents. A necessary requirement is that the method is efficient in its speed and disk usage. Simhash is found to be the measure best suited to the application and it can be combined with other software to increase the accuracy. Pingar have provided an API that will extract the entities from a document and create a taxonomy displaying the relationships and this extra information can be used to accurately classify input documents. Two algorithms are designed incorporating the Pingar API and then finally an efficient comparison algorithm is introduced to cut down the comparisons required.
      Date
      2013-12
      Type
      Working Paper
      Series
      Computer Science Working Papers
      Report No.
      12/2013
      Publisher
      University of Waikato, Department of Computer Science
      Rights
      © 2013 Michael Fowke, Annika Hinze, Ralf Heese.
      Collections
      • 2013 Working Papers [13]
      Show full item record  

      Usage

      Downloads, last 12 months
      33
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement