Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Theses
      • Higher Degree Theses
      • View Item
      •   Research Commons
      • University of Waikato Theses
      • Higher Degree Theses
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      N-gram models of agreement in language

      Smith, Anthony Clive
      Thumbnail
      Files
      thesis.pdf
      10.96Mb
      Permanent link to Research Commons version
      https://hdl.handle.net/10289/14889
      Abstract
      Conventional n-gram language models are well-established as powerful yet simple mechanisms for characterising language structure when low data complexity is the primary objective. Much of their predictive power can be traced to a relatively small number of common word sequences usually comprised of grammatical terms, and a large number of infrequent word patterns comprised of thematic terms with high mutual information. The drawback for conventional approaches is an exceedingly large number of other n-grams which waste probability mass without making a reciprocal contribution in the formulation of accurate probability estimates.

      This thesis describes a simple modification to the n-gram approach which attempts to preserve and enhance the most useful characteristics of conventional models while mitigating their weaknesses by eradicating low utility contexts. If one divides the vocabulary of a language into two broad classes - one comprised solely of contents words (nouns, verbs, adjectives, etc) and the other of grammatical words (determiners, prepositions, modal auxiliaries, etc.) - then language can be viewed as the interlacing of two lexical streams: a content word sequence and a grammatical word sequence. Two words are said to be “super-adjacent” if they are next to each other in one of the two streams.

      It is shown that an n-gram model of super-adjacent terms is better able to exploit the high mutual information of close proximity semantic words and the strong syntactic dependencies exhibited in patterns of grammatical words, while many low-utility n-grams that include words from both classes are eliminated. In addition, by reducing regularly inflected words to their base forms and moving inflectional suffixes to the grammatical stream, large numbers of low frequency content bigrams are collapsed into many fewer more general cases, and morphological agreement is made accessible in abstraction. The result is a more compact model that gives better complexity estimates than is possible from the conventional approach.
      Date
      2000
      Type
      Thesis
      Degree Name
      Doctor of Philosophy (PhD)
      Supervisors
      Witten, Ian H.
      Holmes, Geoffrey
      Cleary, John G.
      Publisher
      The University of Waikato
      Rights
      All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
      Collections
      • Higher Degree Theses [1578]
      Show full item record  

      Usage

       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement