Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions
David Acuna, Ximing Lu, Jaehun Jung, Hyunwoo Kim, Amlan Kar, Sanja Fidler, Yejin Choi

TL;DR
This paper introduces Socratic-MCTS, a search-based method that prompts vision-language models with subquestions to enhance their reasoning capabilities without additional training, leading to improved performance on reasoning benchmarks.
Contribution
It proposes a novel MCTS-inspired algorithm that elicits reasoning in pre-trained models by injecting subquestions, enabling extended reasoning without retraining.
Findings
Achieves a 2% overall improvement on MMMU-PRO benchmark.
Yields a 9% gain in Liberal Arts category.
Demonstrates consistent reasoning improvements across three benchmarks.
Abstract
Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning -- akin to the success observed in language models -- via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces -- without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model's output stream. We show that framing reasoning as a search process -- where subquestions act as latent decisions within a broader inference trajectory -- helps the model "connect the dots" between fragmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
