Fast convergence with a greedy tag-phrase dictionary

Peeters, Ross; Smith, Tony C.

Publication

Fast convergence with a greedy tag-phrase dictionary

Abstract

The best general-purpose compression schemes make their gains by estimating a probability distribution over all possible next symbols given the context established by some number of previous symbols. Such context models typically obtain good compression results for plain text by taking advantage of regularities in character sequences. Frequent words and syllables can be incorporated into the model quickly and thereafter used for reasonably accurate prediction. However, the precise context in which frequent patterns emerge is often extremely varied, and each new word or phrase immediately introduces new contexts which can adversely affect the compression rate. A great deal of the structural regularity in a natural language is given rather more by properties of its grammar than by the orthographic transcription of its phonology. This implies that access to a grammatical abstraction might lead to good compression. While grammatical models have been used successfully for compressing computer programs [4], grammar-based compression of plain text has received little attention, primarily because of the difficulties associated with constructing a suitable natural language grammar. But even without a precise formulation of the syntax of a language, there is a linguistic abstraction which is easily accessed and which demonstrates a high degree of regularity which can be exploited for compression purposes—namely, lexical categories.

Type

Working Paper

Series

Computer Science Working Papers

Citation

Peeters, R. & Smith, T.C. (1997). Fast convergence with a greedy tag-phrase dictionary. (Working paper 97/23). Hamilton, New Zealand: University of Waikato, Department of Computer Science.

Date

1997-11

Fast convergence with a greedy tag-phrase dictionary

Abstract

Type

Type of thesis

Series

Citation

Date

Publisher

Degree

Supervisors

Rights

Files

Permanent link

DOI

Publisher version

Collections