Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

TL;DR
This paper introduces a Chain-of-Focus method for vision language models that adaptively zooms into key image regions for efficient reasoning, trained via supervised fine-tuning and reinforcement learning, leading to significant performance improvements.
Contribution
The paper presents a novel Chain-of-Focus approach with a two-stage training pipeline, including a new dataset and reinforcement learning, enhancing VLMs' reasoning and efficiency.
Findings
Outperforms existing VLMs by 5% on the V* benchmark.
Achieves consistent improvements across 8 image resolutions.
Demonstrates effective adaptive focusing for visual reasoning.
Abstract
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Advanced Text Analysis Techniques · Semantic Web and Ontologies
MethodsShrink and Fine-Tune
