Lossless document image compression

Inglis, Stuart J.

Publication

Lossless document image compression

Abstract

Document image compression reduces the storage requirements for digitised books or documents by using characters as the fundamental unit of compression. Compression gains can be achieved by identifying regions that contain text, isolating unique characters, and storing them in a codebook. This thesis investigates several fundamental areas of the compression process. Algorithms for each area are tested on a corpus of images and the improvements tested for statistical significance. Methods for isolating characters from a bitmap are investigated along with techniques for determining reading order. We introduce the use of the docstrum to aid image compression and show that it improves upon previous methods. The Hough transform is shown to be an accurate method for determining page skew and gives robust results over a range of image resolutions. Compression is shown to improve when the skew of an image is determined automatically, and used to determine reading order. If images can be segmented into regions of text and non-text, it is possible to change the compression model to reflect the region’s contents. Instead of developing ad hoc methods to classify the regions, we introduce the use of machine learning schemes, and show both high predictive accuracy and compression improvements. Machine learning schemes are also applied to the problem of matching characters to determine similarity. Using the scheme’s choice of parameters, the matching methods and their respective compression gains can be compared fairly. We investigate the changes in codebook size and compression gains when compressing multiple pages. The compression system presented is complete, free of arbitrarily introduced parameters and achieves the best known lossless compression on a public corpus of images.

Type

Thesis

Date

1999

Publisher

The University of Waikato

Degree

Doctor of Philosophy (PhD)

Supervisors

Witten, Ian H.

Rights

Lossless document image compression

Abstract

Type

Type of thesis

Series

Citation

Date

Publisher

Degree

Supervisors

Rights

Files

Permanent link

DOI

Publisher version

Collections