Concept-based Text Clustering

Huang, Lan

Concept-based Text Clustering

Authors

Huang, Lan

Files

thesis.pdf (2.26 MB)

Permanent Link

https://hdl.handle.net/10289/5476

Rights

Abstract

Thematic organization of text is a natural practice of humans and a crucial task for today's vast repositories. Clustering automates this by assessing the similarity between texts and organizing them accordingly, grouping like ones together and separating those with different topics. Clusters provide a comprehensive logical structure that facilitates exploration, search and interpretation of current texts, as well as organization of future ones. Automatic clustering is usually based on words. Text is represented by the words it mentions, and thematic similarity is based on the proportion of words that texts have in common. The resulting bag-of-words model is semantically ambiguous and undesirably orthogonal|it ignores the connections between words. This thesis claims that using concepts as the basis of clustering can significantly improve effectiveness. Concepts are defined as units of knowledge. When organized according to the relations among them, they form a concept system. Two concept systems are used here: WordNet, which focuses on word knowledge, and Wikipedia, which encompasses world knowledge. We investigate a clustering procedure with three components: using concepts to represent text; taking the semantic relations among them into account during clustering; and learning a text similarity measure from concepts and their relations. First, we demonstrate that concepts provide a succinct and informative representation of the themes in text, exemplifying this with the two concept systems. Second, we define methods for utilizing concept relations to enhance clustering by making the representation models more discriminative and extending thematic similarity beyond surface overlap. Third, we present a similarity measure based on concepts and their relations that is learned from a small number of examples, and show that it both predicts similarity consistently with human judgement and improves clustering. The thesis provides strong support for the use of concept-based representations instead of the classic bag-of-words model.

Citation

Huang, L. (2011). Concept-based Text Clustering (Thesis, Doctor of Philosophy (PhD)). University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/5476

Type

Thesis

Date

2011

Publisher

University of Waikato

Degree

Doctor of Philosophy (PhD)

Supervisor

Witten, Ian H.
Frank, Eibe

Concept-based Text Clustering

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor