Show simple item record  

dc.contributor.advisorWitten, Ian H.
dc.contributor.authorInglis, Stuart J.
dc.date.accessioned2022-06-10T03:36:29Z
dc.date.available2022-06-10T03:36:29Z
dc.date.issued1999
dc.identifier.urihttps://hdl.handle.net/10289/14909
dc.description.abstractDocument image compression reduces the storage requirements for digitised books or documents by using characters as the fundamental unit of compression. Compression gains can be achieved by identifying regions that contain text, isolating unique characters, and storing them in a codebook. This thesis investigates several fundamental areas of the compression process. Algorithms for each area are tested on a corpus of images and the improvements tested for statistical significance. Methods for isolating characters from a bitmap are investigated along with techniques for determining reading order. We introduce the use of the docstrum to aid image compression and show that it improves upon previous methods. The Hough transform is shown to be an accurate method for determining page skew and gives robust results over a range of image resolutions. Compression is shown to improve when the skew of an image is determined automatically, and used to determine reading order. If images can be segmented into regions of text and non-text, it is possible to change the compression model to reflect the region’s contents. Instead of developing ad hoc methods to classify the regions, we introduce the use of machine learning schemes, and show both high predictive accuracy and compression improvements. Machine learning schemes are also applied to the problem of matching characters to determine similarity. Using the scheme’s choice of parameters, the matching methods and their respective compression gains can be compared fairly. We investigate the changes in codebook size and compression gains when compressing multiple pages. The compression system presented is complete, free of arbitrarily introduced parameters and achieves the best known lossless compression on a public corpus of images.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.publisherThe University of Waikato
dc.rightsAll items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
dc.titleLossless document image compression
dc.typeThesis
thesis.degree.grantorThe University of Waikato
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (PhD)
dc.date.updated2022-06-10T03:30:36Z
pubs.place-of-publicationHamilton, New Zealanden_NZ


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record