Autonomous Cleaning of Corrupted Scanned Documents - A Generative   Modeling Approach

Zhenwen Dai; J\"org L\"ucke

arXiv:1201.2605·cs.CV·October 21, 2014

Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling Approach

Zhenwen Dai, J\"org L\"ucke

PDF

TL;DR

This paper presents a generative modeling approach for autonomously cleaning heavily corrupted scanned documents by learning character representations without supervision and distinguishing dirt from regular patterns.

Contribution

It introduces a probabilistic generative model with a novel variational EM method to learn character features and remove irregular dirt patterns from scanned pages.

Findings

01

Effective cleaning of heavily corrupted pages with limited character examples

02

General applicability across different alphabets and types of corruption

03

Autonomous discrimination between character patterns and dirt

Abstract

We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representations from irregular patterns. To learn character representations, we use a probabilistic generative model parameterizing pattern features, feature variances, the features' planar arrangements, and pattern frequencies. The latent variables of the model describe pattern class, pattern position, and the presence or absence of individual pattern features. The model parameters are optimized using a novel variational EM approximation. After learning, the parameters represent, independently of their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.