What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

Dong Yan; Jian Liang; Yanbo Wang; Shuo Lu; Ran He; Tieniu Tan

arXiv:2603.19880·cs.LG·April 21, 2026

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan

PDF

1 Repo

TL;DR

This paper introduces SCRL, a robust test-time reinforcement learning framework for large language models that uses selective positive and negative pseudo-labeling to improve reasoning accuracy and stability under challenging conditions.

Contribution

SCRL is the first to incorporate negative pseudo-labeling and strict consensus filtering in test-time reinforcement learning for LLMs, enhancing robustness against label noise.

Findings

01

SCRL outperforms baseline methods on multiple reasoning benchmarks.

02

SCRL maintains training stability with limited rollout budgets.

03

SCRL effectively filters unreliable pseudo-labels, improving reasoning accuracy.

Abstract

Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jasper-Yan/SCRL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.