Research Commons
      • Browse 
        • Communities & Collections
        • Titles
        • Authors
        • By Issue Date
        • Subjects
        • Types
        • Series
      • Help 
        • About
        • Collection Policy
        • OA Mandate Guidelines
        • Guidelines FAQ
        • Contact Us
      • My Account 
        • Sign In
        • Register
      View Item 
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computer Science Working Paper Series
      • 1997 Working Papers
      • View Item
      •   Research Commons
      • University of Waikato Research
      • Computing and Mathematical Sciences
      • Computer Science Working Paper Series
      • 1997 Working Papers
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Extracting text from PostScript

      Nevill-Manning, Craig G.; Reed, Todd; Witten, Ian H.
      Thumbnail
      Files
      uow-cs-wp-1997-10.pdf
      204.9Kb
      Find in your library  
      Citation
      Export citation
      Nevill-Manning, C. G., Reed, T. & Witten, I.H. (1997). Extracting text from PostScript. (Working paper 97/10). Hamilton, New Zealand: University of Waikato, Department of Computer Science.
      Permanent Research Commons link: https://hdl.handle.net/10289/1073
      Abstract
      We show how to extract plain text from PostScript files. A textual scan is inadequate because PostScript interpreters can generate characters on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust technique for extracting text and recognizing words and paragraphs. The method uses a standard PostScript interpreter but redefines several PostScript operators, and simple heuristics are employed to locate word and line breaks. The scheme has been used to create a full-text index, and plain-text versions, of 40,000 technical reports (34 Gbyte of PostScript). Other text-extraction systems are reviewed: none offer the same combination of robustness and simplicity.
      Date
      1997-04
      Type
      Working Paper
      Series
      Computer Science Working Papers
      Report No.
      97/10
      Publisher
      Department of Computer Science, University of Waik
      Collections
      • 1997 Working Papers [31]
      Show full item record  

      Usage

      Downloads, last 12 months
      109
       
       

      Usage Statistics

      For this itemFor all of Research Commons

      The University of Waikato - Te Whare Wānanga o WaikatoFeedback and RequestsCopyright and Legal Statement