Publication: A compression-based algorithm for Chinese word segmentation
| dc.contributor.author | Teahan, W.J. | |
| dc.contributor.author | Wen, Yingying | |
| dc.contributor.author | McNab, Rodger J. | |
| dc.contributor.author | Witten, Ian H. | |
| dc.date.accessioned | 2008-10-17T03:48:37Z | |
| dc.date.available | 2008-10-17T03:48:37Z | |
| dc.date.issued | 1999-09 | |
| dc.description.abstract | The Chinese language is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of pre-segmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation. | en_US |
| dc.format.mimetype | application/pdf | |
| dc.identifier.citation | Teahan, W.J., Wen, Y., McNab, R. & Witten, H. (1999). A compression-based algorithm for Chinese word segmentation. (Working paper 99/13). Hamilton, New Zealand: University of Waikato, Department of Computer Science. | en_US |
| dc.identifier.issn | 1170-487X | |
| dc.identifier.uri | https://hdl.handle.net/10289/1042 | |
| dc.language.iso | en | |
| dc.publisher | Computer Science, University of Waikato | en_NZ |
| dc.relation.ispartofseries | Computer Science Working Papers | |
| dc.subject | chinese segmentation | en_US |
| dc.subject | language models | en_US |
| dc.subject | text compression | en_US |
| dc.subject | statistical models | en_US |
| dc.subject | text mining | en_US |
| dc.subject | Machine learning | |
| dc.title | A compression-based algorithm for Chinese word segmentation | en_US |
| dc.type | Working Paper | en_US |
| dspace.entity.type | Publication | |
| pubs.begin-page | 374 | en_NZ |
| pubs.end-page | 393 | en_NZ |
| pubs.place-of-publication | Hamilton | en_NZ |
| uow.relation.series | 99/13 |