Extracting text from PostScript

Nevill-Manning, Craig G.; Reed, Todd; Witten, Ian H.

Extracting text from PostScript

Authors

Nevill-Manning, Craig G.

Reed, Todd

Witten, Ian H.

Files

uow-cs-wp-1997-10.pdf (204.95 KB)

Permanent Link

https://hdl.handle.net/10289/1073

Abstract

We show how to extract plain text from PostScript files. A textual scan is inadequate because PostScript interpreters can generate characters on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust technique for extracting text and recognizing words and paragraphs. The method uses a standard PostScript interpreter but redefines several PostScript operators, and simple heuristics are employed to locate word and line breaks. The scheme has been used to create a full-text index, and plain-text versions, of 40,000 technical reports (34 Gbyte of PostScript). Other text-extraction systems are reviewed: none offer the same combination of robustness and simplicity.

Citation

Nevill-Manning, C. G., Reed, T. & Witten, I.H. (1997). Extracting text from PostScript. (Working paper 97/10). Hamilton, New Zealand: University of Waikato, Department of Computer Science.

Type

Working Paper

Series name

Computer Science Working Papers

Date

1997-04

Publisher

Department of Computer Science, University of Waik

Extracting text from PostScript

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor