MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
Ketong Chen, Yuhao Chen, Yang Xue

TL;DR
MosaicDoc is a large-scale, bilingual benchmark for Visually Rich Document Understanding, created using an innovative multi-agent pipeline to evaluate and advance vision-language models on complex, real-world documents.
Contribution
We introduce DocWeaver, a novel multi-agent pipeline leveraging LLMs to automatically generate MosaicDoc, a comprehensive VRDU benchmark with diverse layouts and annotations in Chinese and English.
Findings
Current models struggle with complex document layouts.
MosaicDoc reveals limitations of existing VRDU models.
Benchmark facilitates future research in real-world document understanding.
Abstract
Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Topic Modeling
