BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

Qi Yang; Xiangyao Ma; Xiao Wang; Hao Wang; Rui Wang

arXiv:2605.10845·cs.CV·May 12, 2026

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

Qi Yang, Xiangyao Ma, Xiao Wang, Hao Wang, Rui Wang

PDF

1 Repo

TL;DR

BabelDOC introduces an IR-based framework for translating PDFs that maintains layout fidelity, enhances visual aesthetics, and ensures terminology consistency, addressing the challenge of cross-lingual document translation.

Contribution

It presents a novel layout-preserving PDF translation method using an intermediate representation, improving over existing approaches in fidelity and aesthetics.

Findings

01

BabelDOC outperforms baselines in layout fidelity and terminology consistency.

02

Human and multimodal LLM evaluations favor BabelDOC's translation quality.

03

The open-source toolkit has gained significant community engagement with over 8.4K GitHub stars.

Abstract

As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

your-repo-url
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.