A compression-based algorithm for Chinese word segmentation

Teahan, W.J.; Wen, Yingying; McNab, Rodger J.; Witten, Ian H.

Publication:
A compression-based algorithm for Chinese word segmentation

dc.contributor.author	Teahan, W.J.
dc.contributor.author	Wen, Yingying
dc.contributor.author	McNab, Rodger J.
dc.contributor.author	Witten, Ian H.
dc.date.accessioned	2008-10-17T03:48:37Z
dc.date.available	2008-10-17T03:48:37Z
dc.date.issued	1999-09
dc.description.abstract	The Chinese language is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of pre-segmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.	en_US
dc.format.mimetype	application/pdf
dc.identifier.citation	Teahan, W.J., Wen, Y., McNab, R. & Witten, H. (1999). A compression-based algorithm for Chinese word segmentation. (Working paper 99/13). Hamilton, New Zealand: University of Waikato, Department of Computer Science.	en_US
dc.identifier.issn	1170-487X
dc.identifier.uri	https://hdl.handle.net/10289/1042
dc.language.iso	en
dc.publisher	Computer Science, University of Waikato	en_NZ
dc.relation.ispartofseries	Computer Science Working Papers
dc.subject	chinese segmentation	en_US
dc.subject	language models	en_US
dc.subject	text compression	en_US
dc.subject	statistical models	en_US
dc.subject	text mining	en_US
dc.subject	Machine learning
dc.title	A compression-based algorithm for Chinese word segmentation	en_US
dc.type	Working Paper	en_US
dspace.entity.type	Publication
pubs.begin-page	374	en_NZ
pubs.end-page	393	en_NZ
pubs.place-of-publication	Hamilton	en_NZ
uow.relation.series	99/13

Files

Original bundle

Now showing 1 - 1 of 1

Name:: uow-cs-wp-1999-13.pdf
Size:: 1.47 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.8 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

1999 Working Papers

Publication: A compression-based algorithm for Chinese word segmentation

Files

Original bundle

License bundle

Collections

Publication:
A compression-based algorithm for Chinese word segmentation