Visual Instruction Tuning with Chain of Region-of-Interest

Yixin Chen; Shuai Zhang; Boran Han; Bernie Wang

arXiv:2505.06840·cs.CV·May 13, 2025

Visual Instruction Tuning with Chain of Region-of-Interest

Yixin Chen, Shuai Zhang, Boran Han, Bernie Wang

PDF

Open Access

TL;DR

This paper introduces CoRoI, a method for visual instruction tuning that focuses on key image regions to improve multimodal understanding efficiently, outperforming existing models on multiple benchmarks.

Contribution

We propose CoRoI, a novel approach that identifies and prioritizes important image regions to reduce computational load in high-resolution multimodal models.

Findings

01

CoRoI improves performance across 11 benchmarks.

02

Our 34B model surpasses proprietary methods on six benchmarks.

03

Outperforms GPT-4V on several multimodal tasks.

Abstract

High-resolution (HR) images are pivotal for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs). However, directly increasing image resolution can significantly escalate computational demands. In this study, we propose a method called Chain of Region-of-Interest (CoRoI) for Visual Instruction Tuning, aimed at alleviating the computational burden associated with high-resolution images for MLLMs. Drawing inspiration from the selective nature of the human visual system, we recognize that not all regions within high-resolution images carry equal importance. CoRoI seeks to identify and prioritize the most informative regions, thereby enhancing multimodal visual comprehension and recognition while circumventing the need for processing lengthy HR image tokens. Through extensive experiments on 11 benchmarks, we validate the efficacy of CoRoI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning