Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

Yifan Du; Kun Zhou; Yingqian Min; Yue Ling; Wayne Xin Zhao; Youbin Wu

arXiv:2511.22586·cs.CV·December 1, 2025

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao, Youbin Wu

PDF

Open Access

TL;DR

This paper investigates how different Chain-of-Thought (CoT) formats influence the generalization of visual reasoning in vision-language models, revealing that concise CoT with minimal grounding outperforms longer, more detailed approaches.

Contribution

It systematically compares various CoT designs in a controlled maze-solving benchmark, providing new insights into effective CoT strategies for visual reasoning.

Findings

01

Concise CoT with essential grounding steps outperforms longer traces.

02

Visual and longer CoT accelerate convergence but do not improve final performance.

03

Minimal grounding results in better generalization across maze sizes.

Abstract

We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Child and Animal Learning Development