Learning a concept-based document similarity measure

dc.contributor.authorHuang, Lan
dc.contributor.authorMilne, David N.
dc.contributor.authorFrank, Eibe
dc.contributor.authorWitten, Ian H.
dc.date.accessioned2012-07-25T04:30:15Z
dc.date.available2012-07-25T04:30:15Z
dc.date.issued2012
dc.description.abstractDocument similarity measures are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine-learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.en_NZ
dc.format.mimetypeapplication/pdf
dc.identifier.citationHuang, L., Milne, D.N., Frank, E. & Witten, I.H. (2012). Learning a concept-based document similarity measure. Journal of American Society for Information Science and Technology, 63(8), 1593-1608 .en_NZ
dc.identifier.doi10.1002/asi.22689en_NZ
dc.identifier.urihttps://hdl.handle.net/10289/6561
dc.language.isoen
dc.publisherWileyen_NZ
dc.relation.isPartOfJournal of the American Society for Information Science and Technologyen_NZ
dc.relation.urihttp://onlinelibrary.wiley.com/doi/10.1002/asi.22689/abstracten_NZ
dc.rightsThis is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology. © 2012 (American Society for Information Science and Technology)en_NZ
dc.subjectcontent analysisen_NZ
dc.subjecttext miningen_NZ
dc.subjectsemantic analysisen_NZ
dc.subjectMachine learning
dc.titleLearning a concept-based document similarity measureen_NZ
dc.typeJournal Articleen_NZ
dspace.entity.typePublication
pubs.begin-page1593en_NZ
pubs.editionAugusten_NZ
pubs.end-page1608en_NZ
pubs.issue8en_NZ
pubs.volume63en_NZ
uow.identifier.article-no8en_NZ

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Learning.pdf
Size:
5.01 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: