Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Yongcan Yu; Lingxiao He; Jian Liang; Kuangpu Guo; Meng Wang; Qianlong Xie; Xingxing Wang; Ran He

arXiv:2604.21327·cs.LG·April 24, 2026

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Yongcan Yu, Lingxiao He, Jian Liang, Kuangpu Guo, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He

PDF

1 Repo

TL;DR

This paper identifies the problem of spurious signal amplification in test-time reinforcement learning for math reasoning and proposes DDRL, a framework that mitigates this issue through sampling, debiased advantage estimation, and off-policy refinement.

Contribution

The paper introduces DDRL, a novel framework that reduces spurious signals in TTRL by combining sampling, bias removal, and stable model updates, improving performance on math reasoning tasks.

Findings

01

DDRL outperforms existing TTRL methods on multiple benchmarks.

02

Spurious signals are amplified by group-relative advantage estimation.

03

Frequency-based sampling reduces ambiguity in training samples.

Abstract

Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuyongcan/DDRL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.