Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?
Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, Klara Nahrstedt

TL;DR
This paper investigates whether inference-time scaling techniques like self verification improve visual language models' mathematical reasoning, finding that simple generation strategies outperform verification and RL behaviors do not enhance reasoning in VLMs.
Contribution
The study provides a comprehensive evaluation of inference-time scaling in VLMs, revealing limitations in self verification and the impact of visual information on reasoning performance.
Findings
Generation time capability outperforms verification strategies.
RL-tuned behaviors like 'Aha moments' do not improve reasoning.
Visual information is not effectively used in self verification.
Abstract
Inference time techniques such as decoding time scaling and self refinement have been shown to substantially improve mathematical reasoning in large language models (LLMs), largely attributed to emergent self correction and self verification behaviors often elicited through reinforcement learning (RL). In this work, we ask whether the same recipe transfers to vision language models (VLMs), especially RL finetuned variants that claim strong visual mathematical reasoning. Through extensive evaluation, we reach three main findings that differ markedly from text only models. First, generation time capability matters more than verification and refinement: simple majority voting consistently and substantially outperforms verification centric strategies such as best of N with self verification. Second, behaviors often associated with RL tuned models at inference time, such as the 'Aha…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
