A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Yihao Ding; Siwen Luo; Yue Dai; Yanbei Jiang; Zechuan Li; Qiang Sun; Geoffrey Martin; Wei Liu; Yifan Peng

arXiv:2507.09861·cs.CV·April 22, 2026

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Qiang Sun, Geoffrey Martin, Wei Liu, Yifan Peng

PDF

TL;DR

This survey reviews recent advances in Multimodal Large Language Models for Visually Rich Document Understanding, focusing on techniques, challenges, and future research directions.

Contribution

It provides a comprehensive overview of MLLM-based VRDU methods, highlighting emerging trends and identifying key challenges in the field.

Findings

01

MLLMs show promise in OCR-based and OCR-free document understanding.

02

Techniques for integrating textual, visual, and layout features are evolving.

03

Challenges include data scarcity and handling multi-page, multilingual documents.

Abstract

Visually Rich Document Understanding (VRDU) has become a pivotal area of research, driven by the need to automatically interpret documents that contain intricate visual, textual, and structural elements. Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant promise in this domain, including both OCR-based and OCR-free approaches for information extraction from document images. This survey reviews recent advances in MLLM-based VRDU, highlighting emerging trends and promising research directions with a focus on two key aspects: (1) techniques for representing and integrating textual, visual, and layout features; (2) training paradigms, including pretraining, instruction tuning, and training strategies. Moreover, we address challenges such as data scarcity, handling multi-page and multilingual documents, and integrating emerging trends such as Retrieval-Augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.