From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou

TL;DR
This paper shows that decoupling perception and reasoning in vision-language models through staged training improves visual perception and reasoning, leading to more efficient and accurate performance.
Contribution
It introduces a novel staged training approach that separately optimizes perception and reasoning, demonstrating significant performance gains over traditional merged training methods.
Findings
Staged training improves perception and reasoning performance.
Models trained with staged training achieve higher reasoning accuracy.
Perception-focused training reduces reasoning trace length.
Abstract
Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
