RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding

Yinglu Li; Zhiying Lu; Zhihang Liu; Yiwei Sun; Chuanbin Liu; Hongtao Xie

arXiv:2510.27261·cs.CV·December 23, 2025

RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding

Yinglu Li, Zhiying Lu, Zhihang Liu, Yiwei Sun, Chuanbin Liu, Hongtao Xie

PDF

Open Access 1 Models

TL;DR

RegionRAG introduces a region-level retrieval approach for visual document understanding, significantly enhancing retrieval precision and answer accuracy by focusing on relevant visual regions rather than entire documents.

Contribution

It proposes a novel region-level retrieval framework with hybrid supervision and dynamic grouping, improving efficiency and accuracy over document-level methods.

Findings

01

Achieves 10.02% higher R@1 retrieval accuracy on average.

02

Boosts question answering accuracy by 3.56%.

03

Uses only 71.42% of visual tokens compared to prior methods.

Abstract

Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose RegionRAG, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Aeryn666/RegionRet
model· 168 dl· ♡ 1
168 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection