Learning to Think Fast and Slow for Visual Language Models
Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

TL;DR
This paper introduces DualMindVLM, a visual language model that mimics human dual-system thinking by adaptively choosing between fast and slow reasoning modes, improving efficiency and accuracy in visual reasoning tasks.
Contribution
It proposes a novel dual-mode training approach leveraging natural response length variations in pre-trained VLMs, enabling explicit fast and slow thinking mechanisms.
Findings
Outperforms baseline models on various benchmarks.
Achieves state-of-the-art reasoning accuracy.
Maintains high token efficiency.
Abstract
When faced with complex problems, we tend to engage in slower, more deliberate thinking. In contrast, for simple questions we give quick, intuitive responses. This dual-system thinking approach allows us to allocate cognitive resources efficiently, reserving deeper analytical effort for tasks that truly require it. However, existing reasoning-oriented visual language models (VLMs) are mostly trained to generate uniformly long reasoning, leading to substantial token waste when concise answers would suffice. In this paper, we observe that pre-trained, general-purpose VLMs manifest variations in response length for different question types, e.g., longer reasoning for math questions while shorter on perception problems. Different from existing work that overrides this prior by stimulating long reasoning without considering the problem complexity, we propose to leverage this prior to develop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Generative Adversarial Networks and Image Synthesis
