Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Theses
      • Higher Degree Theses
      • View Item
      •   Research Commons
      • University of Waikato Theses
      • Higher Degree Theses
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Text Augmentation: Inserting markup into natural language text with PPM Models

      Yeates, Stuart Andrew
      Thumbnail
      Files
      thesis.pdf
      2.068Mb
      Citation
      Export citation
      Yeates, S. A. (2006). Text Augmentation: Inserting markup into natural language text with PPM Models (Thesis, Doctor of Philosophy (PhD)). The University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/2600
      Permanent Research Commons link: https://hdl.handle.net/10289/2600
      Abstract
      This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods.

      Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory.

      A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora.
      Date
      2006
      Type
      Thesis
      Degree Name
      Doctor of Philosophy (PhD)
      Publisher
      The University of Waikato
      Rights
      All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
      Collections
      • Higher Degree Theses [1714]
      Show full item record  

      Usage

      Downloads, last 12 months
      47
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement