TL;DR
This paper identifies the problem of spurious signal amplification in test-time reinforcement learning for math reasoning and proposes DDRL, a framework that mitigates this issue through sampling, debiased advantage estimation, and off-policy refinement.
Contribution
The paper introduces DDRL, a novel framework that reduces spurious signals in TTRL by combining sampling, bias removal, and stable model updates, improving performance on math reasoning tasks.
Findings
DDRL outperforms existing TTRL methods on multiple benchmarks.
Spurious signals are amplified by group-relative advantage estimation.
Frequency-based sampling reduces ambiguity in training samples.
Abstract
Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
