Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking
Chan-Wei Hu, Zhengzhong Tu

TL;DR
Region-R1 introduces a dynamic region cropping approach for multi-modal re-ranking, significantly improving relevance detection by focusing on question-relevant image regions, leading to state-of-the-art results.
Contribution
It proposes a novel region-aware policy optimization method for query-side region cropping in multi-modal re-ranking systems.
Findings
Achieves up to 20% increase in conditional Recall@1 on benchmarks.
Delivers consistent performance improvements across two challenging datasets.
Demonstrates the effectiveness of query-side adaptation in MM-RAG re-ranking.
Abstract
Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
