Text Augmentation: Inserting markup into natural language text with PPM Models

Yeates, Stuart Andrew

Text Augmentation: Inserting markup into natural language text with PPM Models

Authors

Yeates, Stuart Andrew

Files

thesis.pdf (2.07 MB)

Permanent Link

https://hdl.handle.net/10289/2600

Rights

Abstract

This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora.

Citation

Yeates, S. A. (2006). Text Augmentation: Inserting markup into natural language text with PPM Models (Thesis, Doctor of Philosophy (PhD)). The University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/2600

Type

Thesis

Date

2006

Publisher

The University of Waikato

Degree

Doctor of Philosophy (PhD)

Text Augmentation: Inserting markup into natural language text with PPM Models

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor