Text categorization and similarity analysis: similarity measure, architecture and design

dc.contributor.authorFowke, Michael
dc.contributor.authorHinze, Annika
dc.contributor.authorHeese, Ralf
dc.date.accessioned2014-01-29T03:52:39Z
dc.date.available2014-01-29T03:52:39Z
dc.date.issued2013-12
dc.description.abstractThis research looks at the most appropriate similarity measure to use for a document classification problem. The goal is to find a method that is accurate in finding both semantically and version related documents. A necessary requirement is that the method is efficient in its speed and disk usage. Simhash is found to be the measure best suited to the application and it can be combined with other software to increase the accuracy. Pingar have provided an API that will extract the entities from a document and create a taxonomy displaying the relationships and this extra information can be used to accurately classify input documents. Two algorithms are designed incorporating the Pingar API and then finally an efficient comparison algorithm is introduced to cut down the comparisons required.en_NZ
dc.format.mimetypeapplication/pdf
dc.identifier.citationFowke, M., Hinze, A., & Heese, R.(2013). Text categorization and similarity analysis: similarity measure, architecture and design. (Working paper 12/2013). Hamilton, New Zealand: University of Waikato, Department of Computer Science.en_NZ
dc.identifier.issn1177-777X
dc.identifier.urihttps://hdl.handle.net/10289/8433
dc.language.isoenen_NZ
dc.publisherUniversity of Waikato, Department of Computer Scienceen_NZ
dc.relation.ispartofseriesComputer Science Working Papersen_NZ
dc.rights© 2013 Michael Fowke, Annika Hinze, Ralf Heese.en_NZ
dc.subjectcomputer scienceen_NZ
dc.titleText categorization and similarity analysis: similarity measure, architecture and designen_NZ
dc.typeWorking Paperen_NZ
dspace.entity.typePublication
uow.relation.series12/2013en_NZ

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
uow-cs-wp-2013-12.pdf
Size:
504.83 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: