MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

Yongyue Zhang; Yaxiong Wu

arXiv:2602.10271·cs.IR·February 16, 2026

MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

Yongyue Zhang, Yaxiong Wu

PDF

Open Access

TL;DR

MLDocRAG introduces a novel framework for multimodal long-context document retrieval and question answering, using a graph-based query-centric approach to improve evidence aggregation and answer accuracy across diverse modalities and pages.

Contribution

It proposes a Multimodal Chunk-Query Graph (MCQG) for organizing and retrieving multimodal content, advancing long-context understanding in multimodal QA tasks.

Findings

01

Improves retrieval quality and answer accuracy on benchmark datasets

02

Effectively aggregates evidence across modalities and pages

03

Enhances grounding and coherence in multimodal long-context QA

Abstract

Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior