TL;DR
This paper introduces SCRL, a robust test-time reinforcement learning framework for large language models that uses selective positive and negative pseudo-labeling to improve reasoning accuracy and stability under challenging conditions.
Contribution
SCRL is the first to incorporate negative pseudo-labeling and strict consensus filtering in test-time reinforcement learning for LLMs, enhancing robustness against label noise.
Findings
SCRL outperforms baseline methods on multiple reasoning benchmarks.
SCRL maintains training stability with limited rollout budgets.
SCRL effectively filters unreliable pseudo-labels, improving reasoning accuracy.
Abstract
Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
