MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

Ketong Chen; Yuhao Chen; Yang Xue

arXiv:2511.09919·cs.CV·November 14, 2025

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

Ketong Chen, Yuhao Chen, Yang Xue

PDF

Open Access 1 Video

TL;DR

MosaicDoc is a large-scale, bilingual benchmark for Visually Rich Document Understanding, created using an innovative multi-agent pipeline to evaluate and advance vision-language models on complex, real-world documents.

Contribution

We introduce DocWeaver, a novel multi-agent pipeline leveraging LLMs to automatically generate MosaicDoc, a comprehensive VRDU benchmark with diverse layouts and annotations in Chinese and English.

Findings

01

Current models struggle with complex document layouts.

02

MosaicDoc reveals limitations of existing VRDU models.

03

Benchmark facilitates future research in real-world document understanding.

Abstract

Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Topic Modeling