Text categorization and similarity analysis: similarity measure, architecture and design

Fowke, Michael; Hinze, Annika; Heese, Ralf

Text categorization and similarity analysis: similarity measure, architecture and design

Authors

Fowke, Michael

Hinze, Annika

Heese, Ralf

Files

uow-cs-wp-2013-12.pdf (504.83 KB)

Permanent Link

https://hdl.handle.net/10289/8433

Rights

Abstract

This research looks at the most appropriate similarity measure to use for a document classification problem. The goal is to find a method that is accurate in finding both semantically and version related documents. A necessary requirement is that the method is efficient in its speed and disk usage. Simhash is found to be the measure best suited to the application and it can be combined with other software to increase the accuracy. Pingar have provided an API that will extract the entities from a document and create a taxonomy displaying the relationships and this extra information can be used to accurately classify input documents. Two algorithms are designed incorporating the Pingar API and then finally an efficient comparison algorithm is introduced to cut down the comparisons required.

Citation

Fowke, M., Hinze, A., & Heese, R.(2013). Text categorization and similarity analysis: similarity measure, architecture and design. (Working paper 12/2013). Hamilton, New Zealand: University of Waikato, Department of Computer Science.

Type

Working Paper

Series name

Computer Science Working Papers

Date

2013-12

Publisher

University of Waikato, Department of Computer Science

Text categorization and similarity analysis: similarity measure, architecture and design

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor