Divide, Conquer and Combine: A Training-Free Framework for   High-Resolution Image Perception in Multimodal Large Language Models

Wenbin Wang; Liang Ding; Minyan Zeng; Xiabin Zhou; Li Shen; Yong Luo,; Dacheng Tao

arXiv:2408.15556·cs.CV·August 29, 2024

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo,, Dacheng Tao

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces HR-Bench, a new benchmark for high-resolution image perception in multimodal large language models, and proposes a training-free framework, DC$^2$, that improves their understanding of 4K and 8K images by dividing, describing, and combining image patches.

Contribution

The paper presents HR-Bench for evaluating HR image perception and proposes DC$^2$, a novel training-free method to enhance MLLM understanding of high-resolution images.

Findings

01

SOTA MLLMs achieve 63% accuracy on HR-Bench, below human performance of 87%.

02

DC$^2$ improves accuracy by +6% on HR-Bench and +8% on other benchmarks.

03

Leveraging text and image partitioning effectively compensates for visual information loss in HR images.

Abstract

Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DreamMr/HR-Bench
pytorchOfficial

Models

🤗
tuandunghcmut/vlmeval
model

Datasets

DreamMr/HR-Bench
dataset· 5.4k dl
5.4k dl

Videos

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques