A Codebook Generation Algorithm for Document Image Compression
Qin Zhang, John Danskin, Neal Young

TL;DR
This paper introduces a new algorithm for generating codebooks in document image compression, improving pattern selection and achieving nearly 17% better compression than previous heuristics.
Contribution
It extends the cross-entropy approach to the NP-hard pattern selection problem, providing a new algorithm with performance guarantees for better codebook generation.
Findings
Achieves approximately 17% improvement in compression performance.
Provides a provably good algorithm for the k-medians problem.
Outperforms previous heuristics like First Fit and Lloyd's/k-means.
Abstract
Pattern-matching-based document-compression systems (e.g. for faxing) rely on finding a small set of patterns that can be used to represent all of the ink in the document. Finding an optimal set of patterns is NP-hard; previous compression schemes have resorted to heuristics. This paper describes an extension of the cross-entropy approach, used previously for measuring pattern similarity, to this problem. This approach reduces the problem to a k-medians problem, for which the paper gives a new algorithm with a provably good performance guarantee. In comparison to previous heuristics (First Fit, with and without generalized Lloyd's/k-means postprocessing steps), the new algorithm generates a better codebook, resulting in an overall improvement in compression performance of almost 17%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
