Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey
Xiantao Zhang

TL;DR
This survey reviews how Multimodal Large Language Models (MLLMs) enhance visually rich document retrieval in RAG, focusing on their roles, trade-offs, and future research directions.
Contribution
It categorizes MLLM roles in VRD retrieval, compares their characteristics, and provides practical guidance and future research directions.
Findings
MLLMs serve as captioners, embedders, and representers in VRD retrieval.
Trade-offs exist between retrieval granularity, fidelity, latency, and index size.
Future directions include adaptive retrieval, model compression, and new evaluation methods.
Abstract
Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: Modality-Unifying Captioners, Multimodal Embedders, and End-to-End Representers. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role. Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Data Visualization and Analytics
