Compressing semi-structured text using hierarchical phrase identification

Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr.

doi:10.1109/DCC.1996.488311

Compressing semi-structured text using hierarchical phrase identification

Authors

Nevill-Manning, Craig G.

Witten, Ian H.

Olsen, Dan R., Jr.

Files

Compressing.pdf (1.01 MB)

Permanent Link

https://hdl.handle.net/10289/4835

DOI

10.1109/DCC.1996.488311

Rights

©1996 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Abstract

Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

Citation

Nevill-Manning, C.G., Witten, I.H. & Olsen, D.R., Jr. (1996). Compressing semi-structured text using hierarchical phrase identification. In Data Compression Conference (DCC ‘96), Snowbird, Utah, March 31-April 3, 1996 (pp. 63-72). California, USA: IEEE Computer Society Press.

Type

Conference Contribution

Date

1996

Publisher

IEEE Computer Society Press

Compressing semi-structured text using hierarchical phrase identification

Authors

Files

Permanent Link

DOI

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor