Extracting text from PostScript

dc.contributor.authorNevill-Manning, Craig G.
dc.contributor.authorReed, Todd
dc.contributor.authorWitten, Ian H.
dc.date.accessioned2008-10-20T03:33:56Z
dc.date.available2008-10-20T03:33:56Z
dc.date.issued1997-04
dc.description.abstractWe show how to extract plain text from PostScript files. A textual scan is inadequate because PostScript interpreters can generate characters on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust technique for extracting text and recognizing words and paragraphs. The method uses a standard PostScript interpreter but redefines several PostScript operators, and simple heuristics are employed to locate word and line breaks. The scheme has been used to create a full-text index, and plain-text versions, of 40,000 technical reports (34 Gbyte of PostScript). Other text-extraction systems are reviewed: none offer the same combination of robustness and simplicity.en_US
dc.format.mimetypeapplication/pdf
dc.identifier.citationNevill-Manning, C. G., Reed, T. & Witten, I.H. (1997). Extracting text from PostScript. (Working paper 97/10). Hamilton, New Zealand: University of Waikato, Department of Computer Science.en_US
dc.identifier.issn1170-487X
dc.identifier.urihttps://hdl.handle.net/10289/1073
dc.language.isoen
dc.publisherDepartment of Computer Science, University of Waiken_NZ
dc.relation.ispartofseriesComputer Science Working Papers
dc.subjectcomputer scienceen_US
dc.subjectMachine learning
dc.titleExtracting text from PostScripten_US
dc.typeWorking Paperen_US
dspace.entity.typePublication
pubs.place-of-publicationHamiltonen_NZ
uow.relation.series97/10

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
uow-cs-wp-1997-10.pdf
Size:
204.95 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.8 KB
Format:
Item-specific license agreed upon to submission
Description: