MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Fan Yang; Xingping Dong; Xin Yu; Wenhan Luo; Wei Liu; Kaihao Zhang

arXiv:2512.02906·cs.CV·March 20, 2026

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang

PDF

Open Access

TL;DR

The paper introduces MRD, a novel, training-free framework that improves high-resolution image understanding in multimodal large language models by combining multi-resolution semantic fusion and open-vocabulary object detection.

Contribution

MRD is the first framework to integrate multi-resolution semantic fusion with open-vocabulary detection for HR image understanding without additional training.

Findings

01

Achieves state-of-the-art results on HR image benchmarks.

02

Effectively reduces object fragmentation and semantic bias.

03

Enhances both single-object and multi-object understanding.

Abstract

Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning