Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computing and Mathematical Sciences Papers
      • View Item
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computing and Mathematical Sciences Papers
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Compressing semi-structured text using hierarchical phrase identification

      Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr.
      Thumbnail
      Files
      Compressing.pdf
      1.013Mb
      DOI
       10.1109/DCC.1996.488311
      Link
       www.computer.org
      Find in your library  
      Citation
      Export citation
      Nevill-Manning, C.G., Witten, I.H. & Olsen, D.R., Jr. (1996). Compressing semi-structured text using hierarchical phrase identification. In Data Compression Conference (DCC ‘96), Snowbird, Utah, March 31-April 3, 1996 (pp. 63-72). California, USA: IEEE Computer Society Press.
      Permanent Research Commons link: https://hdl.handle.net/10289/4835
      Abstract
      Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.
      Date
      1996
      Type
      Conference Contribution
      Publisher
      IEEE Computer Society Press
      Rights
      ©1996 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
      Collections
      • Computing and Mathematical Sciences Papers [1454]
      Show full item record  

      Usage

      Downloads, last 12 months
      71
       
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement