Thumbnail Image

N-gram models of agreement in language

Conventional n-gram language models are well-established as powerful yet simple mechanisms for characterising language structure when low data complexity is the primary objective. Much of their predictive power can be traced to a relatively small number of common word sequences usually comprised of grammatical terms, and a large number of infrequent word patterns comprised of thematic terms with high mutual information. The drawback for conventional approaches is an exceedingly large number of other n-grams which waste probability mass without making a reciprocal contribution in the formulation of accurate probability estimates. This thesis describes a simple modification to the n-gram approach which attempts to preserve and enhance the most useful characteristics of conventional models while mitigating their weaknesses by eradicating low utility contexts. If one divides the vocabulary of a language into two broad classes - one comprised solely of contents words (nouns, verbs, adjectives, etc) and the other of grammatical words (determiners, prepositions, modal auxiliaries, etc.) - then language can be viewed as the interlacing of two lexical streams: a content word sequence and a grammatical word sequence. Two words are said to be “super-adjacent” if they are next to each other in one of the two streams. It is shown that an n-gram model of super-adjacent terms is better able to exploit the high mutual information of close proximity semantic words and the strong syntactic dependencies exhibited in patterns of grammatical words, while many low-utility n-grams that include words from both classes are eliminated. In addition, by reducing regularly inflected words to their base forms and moving inflectional suffixes to the grammatical stream, large numbers of low frequency content bigrams are collapsed into many fewer more general cases, and morphological agreement is made accessible in abstraction. The result is a more compact model that gives better complexity estimates than is possible from the conventional approach.
Type of thesis
The University of Waikato
All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.