A Codebook Generation Algorithm for Document Image Compression

Qin Zhang; John Danskin; Neal Young

arXiv:cs/0205029·cs.DS·January 19, 2016

A Codebook Generation Algorithm for Document Image Compression

Qin Zhang, John Danskin, Neal Young

PDF

TL;DR

This paper introduces a new algorithm for generating codebooks in document image compression, improving pattern selection and achieving nearly 17% better compression than previous heuristics.

Contribution

It extends the cross-entropy approach to the NP-hard pattern selection problem, providing a new algorithm with performance guarantees for better codebook generation.

Findings

01

Achieves approximately 17% improvement in compression performance.

02

Provides a provably good algorithm for the k-medians problem.

03

Outperforms previous heuristics like First Fit and Lloyd's/k-means.

Abstract

Pattern-matching-based document-compression systems (e.g. for faxing) rely on finding a small set of patterns that can be used to represent all of the ink in the document. Finding an optimal set of patterns is NP-hard; previous compression schemes have resorted to heuristics. This paper describes an extension of the cross-entropy approach, used previously for measuring pattern similarity, to this problem. This approach reduces the problem to a k-medians problem, for which the paper gives a new algorithm with a provably good performance guarantee. In comparison to previous heuristics (First Fit, with and without generalized Lloyd's/k-means postprocessing steps), the new algorithm generates a better codebook, resulting in an overall improvement in compression performance of almost 17%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.