Permanent link to Research Commons versionhttps://hdl.handle.net/10289/16390
Summary form only given. Test categorization is the assignment of natural language texts to predefined categories based on their concept. The use of predefined categories implies a "supervised learning" approach to categorization, where already-classified articles which effectively define the categories are used as "training data" to build a model that can be used for classifying new articles that comprise "the data". Typical approaches extract features from articles and use the feature vectors as input to a machine learning scheme that learns how to classify articles. The features are generally words. It has often been observed that compression seems to provide a very promising alternative approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages: it yields an overall judgement on the document as a whole, rather than discarding information by pre-selecting features it avoids the messy and rather artificial problem of defining word boundaries; it deals uniformly with morphological variants of words; depending on the model (and its order), it can take account of phrasal effects that span word boundaries; it offers a uniform way of dealing with different types of documents for example, arbitrary files in a computer system; it generally minimizes arbitrary decisions that inevitably need to be taken to render any learning scheme practical.
This is an author’s accepted version of a conference paper published in: Proc Data Compression Conference. © 2000 IEEE.