Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language   Models

Zuyan Liu; Yuhao Dong; Yongming Rao; Jie Zhou; Jiwen Lu

arXiv:2403.12966·cs.CV·March 22, 2024·1 cites

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu

PDF

Open Access 1 Repo

TL;DR

The paper introduces Chain-of-Spot, an interactive reasoning method that enhances large vision-language models by focusing on key image regions, leading to improved visual understanding and reasoning without increasing image resolution.

Contribution

It presents a novel interactive reasoning approach that improves feature extraction in LVLMs by focusing on key regions, achieving state-of-the-art results.

Findings

01

Significant performance improvements across multiple benchmarks.

02

Enhanced visual reasoning without increasing image resolution.

03

Effective integration with existing instruct-following models.

Abstract

In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications. However, it is challenging for the visual encoder in Large Vision-Language Models (LVLMs) to extract useful features tailored to questions that aid the language model's response. Furthermore, a common practice among existing LVLMs is to utilize lower-resolution images, which restricts the ability for visual recognition. Our work introduces the Chain-of-Spot (CoS) method, which we describe as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows LVLMs to access more detailed visual information without altering the original image resolution, thereby offering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dongyh20/chain-of-spot
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies · Natural Language Processing Techniques