Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen, Shuai Zhang, Boran Han, Bernie Wang

TL;DR
This paper introduces CoRoI, a method for visual instruction tuning that focuses on key image regions to improve multimodal understanding efficiently, outperforming existing models on multiple benchmarks.
Contribution
We propose CoRoI, a novel approach that identifies and prioritizes important image regions to reduce computational load in high-resolution multimodal models.
Findings
CoRoI improves performance across 11 benchmarks.
Our 34B model surpasses proprietary methods on six benchmarks.
Outperforms GPT-4V on several multimodal tasks.
Abstract
High-resolution (HR) images are pivotal for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs). However, directly increasing image resolution can significantly escalate computational demands. In this study, we propose a method called Chain of Region-of-Interest (CoRoI) for Visual Instruction Tuning, aimed at alleviating the computational burden associated with high-resolution images for MLLMs. Drawing inspiration from the selective nature of the human visual system, we recognize that not all regions within high-resolution images carry equal importance. CoRoI seeks to identify and prioritize the most informative regions, thereby enhancing multimodal visual comprehension and recognition while circumventing the need for processing lengthy HR image tokens. Through extensive experiments on 11 benchmarks, we validate the efficacy of CoRoI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
