MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang

TL;DR
The paper introduces MRD, a novel, training-free framework that improves high-resolution image understanding in multimodal large language models by combining multi-resolution semantic fusion and open-vocabulary object detection.
Contribution
MRD is the first framework to integrate multi-resolution semantic fusion with open-vocabulary detection for HR image understanding without additional training.
Findings
Achieves state-of-the-art results on HR image benchmarks.
Effectively reduces object fragmentation and semantic bias.
Enhances both single-object and multi-object understanding.
Abstract
Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
