DocAtlas: Multilingual Document Understanding Across 80+ Languages
Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan

TL;DR
DocAtlas introduces a comprehensive multilingual document understanding framework for over 80 languages, utilizing high-fidelity OCR datasets, novel annotation methods, and effective training strategies to address low-resource language challenges.
Contribution
It presents a new dataset construction pipeline, a unified annotation format, and a training approach that enhances multilingual performance without degrading base-language accuracy.
Findings
Evaluating 16 models reveals significant gaps in low-resource scripts.
DPO with rendering-derived ground truth improves multilingual adaptation.
DocAtlas-DeepSeek outperforms baseline by 1.7% in accuracy.
Abstract
Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
