MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering

Hui Wu; Haoquan Zhai; Yuchen Li; Hengyi Cai; Peirong Zhang; Yidan Zhang; Lei Wang; Chunle Wang; Yingyan Hou; Shuaiqiang Wang; Dawei Yin

arXiv:2604.16313·cs.IR·April 21, 2026

MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering

Hui Wu, Haoquan Zhai, Yuchen Li, Hengyi Cai, Peirong Zhang, Yidan Zhang, Lei Wang, Chunle Wang, Yingyan Hou, Shuaiqiang Wang, Dawei Yin

PDF

TL;DR

MARA is a novel framework for multimodal document question answering that uses query-adaptive retrieval and generation mechanisms to improve relevance and answer quality.

Contribution

It introduces query-adaptive retrieval and generation components, addressing limitations of static evidence selection in multimodal QA.

Findings

01

MARA outperforms existing SOTA methods on six benchmarks.

02

It improves retrieval relevance and answer accuracy.

03

The adaptive mechanisms enhance handling complex multimodal documents.

Abstract

Retrieval-based multimodal document QA aims to identify and integrate relevant information from visually rich documents with complex multimodal structures. While retrieval-augmented generation (RAG) has shown strong performance in text-based QA, its extensions to multimodal documents remain underexplored and face significant limitations. Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. MARA consists of two components: a Query-Aligned Region Encoder that builds multi-level document representations and reweights them based on query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.