Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

Gyuho Shim; Seongtae Hong; Heuiseok Lim

arXiv:2604.08115·cs.AI·April 10, 2026

Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

Gyuho Shim, Seongtae Hong, Heuiseok Lim

PDF

1 Video

TL;DR

Revise is a framework that systematically corrects OCR errors at multiple levels using synthetic data, improving document understanding and management in Document AI applications.

Contribution

It introduces a hierarchical taxonomy of OCR errors and a synthetic data generation strategy to train effective correction models for structured document understanding.

Findings

01

Revise significantly improves OCR correction accuracy.

02

Enhanced OCR correction leads to better document retrieval and question answering performance.

03

The framework effectively manages structural errors in OCR outputs.

Abstract

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy· underline