Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

Bodong Du; Xuanqi Huang; Xiaomeng Li

arXiv:2601.21804·cs.CL·January 30, 2026

Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

Bodong Du, Xuanqi Huang, Xiaomeng Li

PDF

Open Access

TL;DR

This paper introduces DARE, a distribution-aware reward estimation method for test-time reinforcement learning that improves reward accuracy and robustness by considering the full rollout distribution, leading to better language model self-improvement.

Contribution

DARE shifts reward estimation from majority voting to a full distribution approach, incorporating exploration and pruning for more reliable self-improvement in LLMs.

Findings

01

DARE outperforms recent baselines on reasoning benchmarks.

02

Achieves 25.3% improvement on AIME 2024.

03

Achieves 5.3% improvement on AMC.

Abstract

Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications