Loading...
Thumbnail Image
Publication

Text categorization using compression models

Abstract
Summary form only given. Test categorization is the assignment of natural language texts to predefined categories based on their concept. The use of predefined categories implies a "supervised learning" approach to categorization, where already-classified articles which effectively define the categories are used as "training data" to build a model that can be used for classifying new articles that comprise "the data". Typical approaches extract features from articles and use the feature vectors as input to a machine learning scheme that learns how to classify articles. The features are generally words. It has often been observed that compression seems to provide a very promising alternative approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages: it yields an overall judgement on the document as a whole, rather than discarding information by pre-selecting features it avoids the messy and rather artificial problem of defining word boundaries; it deals uniformly with morphological variants of words; depending on the model (and its order), it can take account of phrasal effects that span word boundaries; it offers a uniform way of dealing with different types of documents for example, arbitrary files in a computer system; it generally minimizes arbitrary decisions that inevitably need to be taken to render any learning scheme practical.
Type
Conference Contribution
Type of thesis
Series
Citation
Date
2000-03
Publisher
IEEE
Degree
Supervisors
Rights
This is an author’s accepted version of a conference paper published in: Proc Data Compression Conference. © 2000 IEEE.