DocRevive: A Unified Pipeline for Document Text Restoration

Kunal Purkayastha; Ayan Banerjee; Josep Llados; Umapada Pal

arXiv:2604.10077·cs.CV·April 14, 2026

DocRevive: A Unified Pipeline for Document Text Restoration

Kunal Purkayastha, Ayan Banerjee, Josep Llados, Umapada Pal

PDF

1 Repo

TL;DR

This paper introduces a comprehensive pipeline for restoring damaged or incomplete document text using OCR, image analysis, language modeling, and diffusion techniques, supported by a synthetic dataset and a new similarity metric.

Contribution

It presents a novel unified document restoration pipeline combining multiple advanced models and introduces a synthetic dataset and a new evaluation metric for restoration quality.

Findings

01

Created a synthetic dataset of 30,078 degraded document images.

02

Developed a pipeline that detects, recognizes, and reconstructs text with semantic coherence.

03

Proposed UCSM, a new metric for evaluating restoration quality.

Abstract

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kunalpurkayastha/DocRevive
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.