Compressing semi-structured text using hierarchical phrase identification

dc.contributor.authorNevill-Manning, Craig G.
dc.contributor.authorWitten, Ian H.
dc.contributor.authorOlsen, Dan R., Jr.
dc.date.accessioned2010-12-02T21:15:14Z
dc.date.available2010-12-02T21:15:14Z
dc.date.issued1996
dc.description.abstractMany computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.en_NZ
dc.format.mimetypeapplication/pdf
dc.identifier.citationNevill-Manning, C.G., Witten, I.H. & Olsen, D.R., Jr. (1996). Compressing semi-structured text using hierarchical phrase identification. In Data Compression Conference (DCC ‘96), Snowbird, Utah, March 31-April 3, 1996 (pp. 63-72). California, USA: IEEE Computer Society Press.en_NZ
dc.identifier.doi10.1109/DCC.1996.488311en_NZ
dc.identifier.urihttps://hdl.handle.net/10289/4835
dc.language.isoen
dc.publisherIEEE Computer Society Pressen_NZ
dc.relation.urihttp://www.computer.org/portal/web/csdl/doi/10.1109/DCC.1996.488311en_NZ
dc.rights©1996 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.en_NZ
dc.subjectcomputer scienceen_NZ
dc.subjectcompressingen_NZ
dc.subjectMachine learning
dc.subjectMachine learning
dc.titleCompressing semi-structured text using hierarchical phrase identificationen_NZ
dc.typeConference Contributionen_NZ
dspace.entity.typePublication

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Compressing.pdf
Size:
1.01 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: