Escaping the Verifier: Learning to Reason via Demonstrations
Locke Cai, Ivan Provilkov

TL;DR
This paper introduces RARO, a method that leverages expert demonstrations and adversarial training to enhance reasoning capabilities in large language models without relying on task-specific verifiers.
Contribution
RARO is a novel adversarial learning approach that trains reasoning models solely from expert demonstrations using inverse reinforcement learning, eliminating the need for verifiers.
Findings
RARO outperforms verifier-free baselines on reasoning tasks
The method scales robustly with larger models and data
It effectively learns reasoning skills from demonstrations alone
Abstract
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
