Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs
Yahan Yu, Yuyang Dong, Masafumi Oyamada

TL;DR
This paper introduces the D2I framework that enhances multimodal LLM reasoning by using format rewards during training, enabling better reasoning without extra annotations, and shifting to intuitive reasoning during evaluation.
Contribution
The D2I method improves multimodal LLM reasoning by decoupling training and testing reasoning styles using format rewards, without requiring additional data annotations.
Findings
D2I outperforms baselines on various benchmarks.
Format reward fosters transferable reasoning skills.
Decoupling training and test reasoning enhances flexibility.
Abstract
Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper is straightforward and easy to understand. 2. The three methods sound reasonable for the training stage. 3. The results achieve improvements on both in-domain and out-of-domain benchmarks.
1. Adding citations in Line 094-096 about how deliberate reasoning behaviors are commonly adopted can help readers gain a better understanding of this field. 2. The citation at Line 499 is incomplete. 3. Line 394: MMe -> MME 4. Lacks citations and discussion of previous reasoning MLLMs and related works, e.g., [1-5] [1] Vision-r1: Incentivizing reasoning capability in multimodal large language models [2] R1-vl: Learning to reason with multimodal large language models via step-wise group relati
1. This paper proposes three different reasoning strategies, which helps model to learn the alignment across image and text modals. Also the training stage focuses on format reward, which means additional human annotations or expensive data generation are not required, making the method cost-effective and scalable. 2. Allowing flexible generation in inference stage is innovate, and the experiments results on in-domain dataset further support the effectiveness of this D2I framework.
1. It looks like the citation format is somehow wrong. There are no brackets warping the citations, makes them mixed with other texts. Also, the figures are not well organized, for instance Fig 6 shows before Fig 5. 2. The results on out-of-domain dataset shows that D2D achieve better performance than D2I models. However authors lack detailed description of this phenomenon.
1. The core idea directly addresses data and engineering bottlenecks in multimodal reasoning. Because the rewards are deterministic and easy to verify (presence/structure of tags, answer format), the approach is cheap to implement, simple to maintain, and broadly reproducible. This lowers the barrier to deploying reasoning-oriented training at scale and makes the method attractive beyond research prototypes. 2. D2I’s separation between deliberate training and intuitive inference is conceptually
1. The format rewards primarily check structure, including shape of <box>, <crucial>, <parse>, rather than semantic fidelity. A box that doesn’t truly cover the evidence, a plausible-sounding justification that is not grounded in the image, or a parse with internally inconsistent relations can still receive a reward. This risks teaching “good-looking” reasoning artifacts that do not guarantee genuine visual grounding. 2. Most compelling results hinge on a single backbone/configuration and visua
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning
