Smith, T.C. & Lorenz, M.(2001). Better text compression from fewer lexical n-grams. In Proceedings of Data Compression Conference (DCC ‘01). Washington, DC, USA: IEEE Computer Society.
Permanent Research Commons link: http://hdl.handle.net/10289/1722
Word-based context models for text compression have the capacity to outperform more simple character-based models, but are generally unattractive because of inherent problems with exponential model growth and corresponding data sparseness. These ill-effects can be mitigated in an adaptive lossless compression scheme by modelling syntactic and semantic lexical dependencies independently.
IEEE Computer Society
This paper has been published in the Proceedings of Data Compression Conference(DCC ‘01). ©2001 IEEE Computer Society.