Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computer Science Working Paper Series
      • 1999 Working Papers
      • View Item
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computer Science Working Paper Series
      • 1999 Working Papers
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      A compression-based algorithm for Chinese word segmentation

      Teahan, W.J.; Wen, Yingying; McNab, Rodger J.; Witten, Ian H.
      Thumbnail
      Files
      uow-cs-wp-1999-13.pdf
      1.474Mb
      Find in your library  
      Citation
      Export citation
      Teahan, W.J., Wen, Y., McNab, R. & Witten, H. (1999). A compression-based algorithm for Chinese word segmentation. (Working paper 99/13). Hamilton, New Zealand: University of Waikato, Department of Computer Science.
      Permanent Research Commons link: https://hdl.handle.net/10289/1042
      Abstract
      The Chinese language is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction.

      We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of pre-segmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.
      Date
      1999-09
      Type
      Working Paper
      Series
      Computer Science Working Papers
      Report No.
      99/13
      Publisher
      Computer Science, University of Waikato
      Collections
      • 1999 Working Papers [16]
      Show full item record  

      Usage

      Downloads, last 12 months
      115
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement