Loading...
Abstract
Document image compression reduces the storage requirements for digitised books or documents by using characters as the fundamental unit of compression. Compression gains can be achieved by identifying regions that contain text, isolating unique characters, and storing them in a codebook.
This thesis investigates several fundamental areas of the compression process. Algorithms for each area are tested on a corpus of images and the improvements tested for statistical significance. Methods for isolating characters from a bitmap are investigated along with techniques for determining reading order. We introduce the use of the docstrum to aid image compression and show that it improves upon previous methods. The Hough transform is shown to be an accurate method for determining page skew and gives robust results over a range of image resolutions. Compression is shown to improve when the skew of an image is determined automatically, and used to determine reading order.
If images can be segmented into regions of text and non-text, it is possible to change the compression model to reflect the region’s contents. Instead of developing ad hoc methods to classify the regions, we introduce the use of machine learning schemes, and show both high predictive accuracy and compression improvements. Machine learning schemes are also applied to the problem of matching characters to determine similarity. Using the scheme’s choice of parameters, the matching methods and their respective compression gains can be compared fairly. We investigate the changes in codebook size and compression gains when compressing multiple pages.
The compression system presented is complete, free of arbitrarily introduced parameters and achieves the best known lossless compression on a public corpus of images.
Type
Thesis
Type of thesis
Series
Citation
Date
1999
Publisher
The University of Waikato
Supervisors
Rights
All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.