TL;DR
This paper introduces a comprehensive pipeline for restoring damaged or incomplete document text using OCR, image analysis, language modeling, and diffusion techniques, supported by a synthetic dataset and a new similarity metric.
Contribution
It presents a novel unified document restoration pipeline combining multiple advanced models and introduces a synthetic dataset and a new evaluation metric for restoration quality.
Findings
Created a synthetic dataset of 30,078 degraded document images.
Developed a pipeline that detects, recognizes, and reconstructs text with semantic coherence.
Proposed UCSM, a new metric for evaluating restoration quality.
Abstract
In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
