Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin; Cheng Chi; Jinlin Wu; Sharon Li; Kaiyang Zhou

arXiv:2511.16670·cs.CV·March 10, 2026

Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

PDF

Open Access 1 Models

TL;DR

This paper introduces DualMindVLM, a visual language model that mimics human dual-system thinking by adaptively choosing between fast and slow reasoning modes, improving efficiency and accuracy in visual reasoning tasks.

Contribution

It proposes a novel dual-mode training approach leveraging natural response length variations in pre-trained VLMs, enabling explicit fast and slow thinking mechanisms.

Findings

01

Outperforms baseline models on various benchmarks.

02

Achieves state-of-the-art reasoning accuracy.

03

Maintains high token efficiency.

Abstract

When faced with complex problems, we tend to engage in slower, more deliberate thinking. In contrast, for simple questions we give quick, intuitive responses. This dual-system thinking approach allows us to allocate cognitive resources efficiently, reserving deeper analytical effort for tasks that truly require it. However, existing reasoning-oriented visual language models (VLMs) are mostly trained to generate uniformly long reasoning, leading to substantial token waste when concise answers would suffice. In this paper, we observe that pre-trained, general-purpose VLMs manifest variations in response length for different question types, e.g., longer reasoning for math questions while shorter on perception problems. Different from existing work that overrides this prior by stimulating long reasoning without considering the problem complexity, we propose to leverage this prior to develop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
maifoundations/DualMindVLM
model· 2 dl· ♡ 1
2 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Generative Adversarial Networks and Image Synthesis