N-gram models of agreement in language

Smith, Anthony Clive

N-gram models of agreement in language

Authors

Smith, Anthony Clive

Files

thesis.pdf (10.97 MB)

Permanent Link

https://hdl.handle.net/10289/14889

Rights

Abstract

Conventional n-gram language models are well-established as powerful yet simple mechanisms for characterising language structure when low data complexity is the primary objective. Much of their predictive power can be traced to a relatively small number of common word sequences usually comprised of grammatical terms, and a large number of infrequent word patterns comprised of thematic terms with high mutual information. The drawback for conventional approaches is an exceedingly large number of other n-grams which waste probability mass without making a reciprocal contribution in the formulation of accurate probability estimates. This thesis describes a simple modification to the n-gram approach which attempts to preserve and enhance the most useful characteristics of conventional models while mitigating their weaknesses by eradicating low utility contexts. If one divides the vocabulary of a language into two broad classes - one comprised solely of contents words (nouns, verbs, adjectives, etc) and the other of grammatical words (determiners, prepositions, modal auxiliaries, etc.) - then language can be viewed as the interlacing of two lexical streams: a content word sequence and a grammatical word sequence. Two words are said to be “super-adjacent” if they are next to each other in one of the two streams. It is shown that an n-gram model of super-adjacent terms is better able to exploit the high mutual information of close proximity semantic words and the strong syntactic dependencies exhibited in patterns of grammatical words, while many low-utility n-grams that include words from both classes are eliminated. In addition, by reducing regularly inflected words to their base forms and moving inflectional suffixes to the grammatical stream, large numbers of low frequency content bigrams are collapsed into many fewer more general cases, and morphological agreement is made accessible in abstraction. The result is a more compact model that gives better complexity estimates than is possible from the conventional approach.

Type

Thesis

Date

2000

Publisher

The University of Waikato

Degree

Doctor of Philosophy (PhD)

Supervisor

Witten, Ian H.
Holmes, Geoffrey
Cleary, John G.

N-gram models of agreement in language

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor