Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Theses
      • Masters Degree Theses
      • View Item
      •   Research Commons
      • University of Waikato Theses
      • Masters Degree Theses
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language

      Cocks, John
      Thumbnail
      Files
      thesis.pdf
      1.534Mb
      Citation
      Export citation
      Cocks, J. (2012). Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language (Thesis, Master of Science (MSc)). University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/6483
      Permanent Research Commons link: https://hdl.handle.net/10289/6483
      Abstract
      This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap:

      Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely Māori, and an accuracy exceeding 99% was observed.

      Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in Māori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the Māori grammar.
      Date
      2012
      Type
      Thesis
      Degree Name
      Master of Science (MSc)
      Supervisors
      Keegan, Te Taka
      Publisher
      University of Waikato
      Rights
      All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
      Collections
      • Masters Degree Theses [2387]
      Show full item record  

      Usage

      Downloads, last 12 months
      57
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement