Publication:
A compression-based algorithm for Chinese word segmentation

dc.contributor.authorTeahan, W.J.
dc.contributor.authorWen, Yingying
dc.contributor.authorMcNab, Rodger J.
dc.contributor.authorWitten, Ian H.
dc.date.accessioned2008-10-17T03:48:37Z
dc.date.available2008-10-17T03:48:37Z
dc.date.issued1999-09
dc.description.abstractThe Chinese language is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of pre-segmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.en_US
dc.format.mimetypeapplication/pdf
dc.identifier.citationTeahan, W.J., Wen, Y., McNab, R. & Witten, H. (1999). A compression-based algorithm for Chinese word segmentation. (Working paper 99/13). Hamilton, New Zealand: University of Waikato, Department of Computer Science.en_US
dc.identifier.issn1170-487X
dc.identifier.urihttps://hdl.handle.net/10289/1042
dc.language.isoen
dc.publisherComputer Science, University of Waikatoen_NZ
dc.relation.ispartofseriesComputer Science Working Papers
dc.subjectchinese segmentationen_US
dc.subjectlanguage modelsen_US
dc.subjecttext compressionen_US
dc.subjectstatistical modelsen_US
dc.subjecttext miningen_US
dc.subjectMachine learning
dc.titleA compression-based algorithm for Chinese word segmentationen_US
dc.typeWorking Paperen_US
dspace.entity.typePublication
pubs.begin-page374en_NZ
pubs.end-page393en_NZ
pubs.place-of-publicationHamiltonen_NZ
uow.relation.series99/13

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
uow-cs-wp-1999-13.pdf
Size:
1.47 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.8 KB
Format:
Item-specific license agreed upon to submission
Description: