URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding
Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin, Lianwen Jin

TL;DR
URaG introduces a unified framework that leverages the inherent evidence localization ability of multimodal LLMs to perform retrieval and generation simultaneously, significantly improving long document understanding efficiency and accuracy.
Contribution
The paper proposes URaG, a novel framework that unifies retrieval and generation in multimodal LLMs, explicitly leveraging their coarse-to-fine reasoning pattern for efficient long document processing.
Findings
Achieves state-of-the-art performance on long document tasks.
Reduces computational overhead by 44-56%.
Effectively localizes relevant evidence during reasoning.
Abstract
Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
