Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo,, Dacheng Tao

TL;DR
This paper introduces HR-Bench, a new benchmark for high-resolution image perception in multimodal large language models, and proposes a training-free framework, DC$^2$, that improves their understanding of 4K and 8K images by dividing, describing, and combining image patches.
Contribution
The paper presents HR-Bench for evaluating HR image perception and proposes DC$^2$, a novel training-free method to enhance MLLM understanding of high-resolution images.
Findings
SOTA MLLMs achieve 63% accuracy on HR-Bench, below human performance of 87%.
DC$^2$ improves accuracy by +6% on HR-Bench and +8% on other benchmarks.
Leveraging text and image partitioning effectively compensates for visual information loss in HR images.
Abstract
Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
