Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang; Zhi Gao; Bofei Zhang; Pengxiang Li; Xiaowen Zhang; Yang Liu; Tao Yuan; Yuwei Wu; Yunde Jia; Song-Chun Zhu; Qing Li

arXiv:2505.15436·cs.CV·December 8, 2025

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces a Chain-of-Focus method for vision language models that adaptively zooms into key image regions for efficient reasoning, trained via supervised fine-tuning and reinforcement learning, leading to significant performance improvements.

Contribution

The paper presents a novel Chain-of-Focus approach with a two-stage training pipeline, including a new dataset and reinforcement learning, enhancing VLMs' reasoning and efficiency.

Findings

01

Outperforms existing VLMs by 5% on the V* benchmark.

02

Achieves consistent improvements across 8 image resolutions.

03

Demonstrates effective adaptive focusing for visual reasoning.

Abstract

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
xintongzhang/CoF-sft-model-7b
model· 44 dl
44 dl

Datasets

xintongzhang/CoF-SFT-Data-5.4k
dataset· 135 dl
135 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Advanced Text Analysis Techniques · Semantic Web and Ontologies

MethodsShrink and Fine-Tune