TL;DR
RAVR introduces an answer-guided reasoning framework for large language models, improving their reasoning capabilities by leveraging answer-conditioned paths, which enhances learning efficiency and reasoning quality.
Contribution
The paper proposes RAVR, a novel answer-guided variational reasoning method that transforms question-only reasoning into a learnable process by conditioning on answers.
Findings
RAVR improves reasoning accuracy over strong baselines.
It reduces hesitation and consolidates conclusions in reasoning.
Enhances problem-specific strategies in LLM reasoning.
Abstract
Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
+ The paper introduces an answer-conditioned variational objective that meaningfully improves exploration in RL-based reasoning, offering a conceptual advance over prior reward-only or STaR-style methods. + The experiment results show consistent performance gains across multiple reasoning benchmarks, supported by various ablations and behavioral analyses that validate the source of improvement.
+ RAVR treats the reference-answer token probability as the reward signal for the generated reasoning trace. However, this self-supervised likelihood-based reward is known to be noisy and susceptible to reward hacking, where models can optimize for probability shortcuts and could easily lead to training collapse. + The method depends on having gold reference answers during training, which may limit its applicability in weakly supervised or open-ended settings where answers are unavailable or a
1. Framing answer-conditioned reasoning as an amortized posterior that teaches the question-only prior via an ELBO-style objective, plus practical touches (utility-baseline, KL reweighting, answer-prefix, posterior “think-aloud” instructions), is novel. I mean using answers to induce better reasoning is standard, but incorporating it into training objectives is interesting. 2. The theoretical part is clearly derived, posterior re-weighting by s(z) implies a larger mass on good traces and a highe
1. I am still concerned that models can inflate likelihood via stylistic cues ("answer-prefix"), instead of improving true correctness. The paper partly addresses this with a baseline and length normalization, but there is no more discussion. 2. Not sure if RAVR's advantage persists under equal wall-clock or token budgets as RL baselines. 3. All results use Qwen3-1.7B. If during rebuttal the authors can show similar improvements on other and even better larger models, I will raise my score. 4. N
solid theoretical foundation, as it provides rigorous mathematical proofs to validate that reference-answer conditioning amplifies high-utility reasoning paths Its innovative integration of variational inference, using answer-conditioned reasoning as a surrogate for question-only reasoning, enables end-to-end training, avoiding the multi-stage complexity of existing methods like STaR.
- The reliance on specific LLMs (Qwen3-1.7B) and benchmarks raises concerns about generalizability - the paper focuses on tasks with clear reference answers (e.g., multiple-choice, math problems), leaving untested whether RAVR works for scenarios where answers are ambiguous or multi-faceted. - I'm not quite sure how the method performs RL training, e.g., GRPO. Is it that the input is x and y, and the output is z? What are the reward signals in GRPO? As far as I know, the thought path (z) is n
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
