TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Chengye Wang; Lin Fu; Zexi Kuang; Yilun Zhao

arXiv:2604.22880·cs.CL·April 28, 2026

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Chengye Wang, Lin Fu, Zexi Kuang, Yilun Zhao

PDF

1 Repo

TL;DR

This paper introduces TexOCR, a model and benchmark for reconstructing scientific PDFs into LaTeX, emphasizing structural accuracy and compilability, with reinforcement learning improving results over supervised fine-tuning.

Contribution

The paper presents TexOCR, a large-scale training corpus, a benchmark, and a novel reinforcement learning approach for accurate, structurally faithful LaTeX reconstruction from PDFs.

Findings

01

Existing OCR systems often violate document invariants, affecting compilation.

02

Reinforcement learning with verifiable rewards improves structural and compilation metrics.

03

TexOCR outperforms 21 frontier models on the TexOCR-Bench evaluation suite.

Abstract

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qdrhhhh/TexOCR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.