RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding
Yinglu Li, Zhiying Lu, Zhihang Liu, Yiwei Sun, Chuanbin Liu, Hongtao Xie

TL;DR
RegionRAG introduces a region-level retrieval approach for visual document understanding, significantly enhancing retrieval precision and answer accuracy by focusing on relevant visual regions rather than entire documents.
Contribution
It proposes a novel region-level retrieval framework with hybrid supervision and dynamic grouping, improving efficiency and accuracy over document-level methods.
Findings
Achieves 10.02% higher R@1 retrieval accuracy on average.
Boosts question answering accuracy by 3.56%.
Uses only 71.42% of visual tokens compared to prior methods.
Abstract
Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose RegionRAG, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
