ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
Weihang You, Qingchan Zhu, David Liu, Yi Pan, Geng Yuan, Hanqi Jiang

TL;DR
ViThinker introduces an active vision-language reasoning framework where models generate decision queries to synthesize visual features on demand, improving reasoning accuracy by mimicking human active perception.
Contribution
It presents a novel active querying approach for vision-language models, enabling autonomous generation of visual features without external tools, inspired by human perception.
Findings
Outperforms passive methods in vision reasoning tasks
Enhances perceptual grounding and reasoning accuracy
Demonstrates consistent improvements across benchmarks
Abstract
Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Neurobiology of Language and Bilingualism
