Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning
Bodong Du, Xuanqi Huang, Xiaomeng Li

TL;DR
This paper introduces DARE, a distribution-aware reward estimation method for test-time reinforcement learning that improves reward accuracy and robustness by considering the full rollout distribution, leading to better language model self-improvement.
Contribution
DARE shifts reward estimation from majority voting to a full distribution approach, incorporating exploration and pruning for more reliable self-improvement in LLMs.
Findings
DARE outperforms recent baselines on reasoning benchmarks.
Achieves 25.3% improvement on AIME 2024.
Achieves 5.3% improvement on AMC.
Abstract
Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
