When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

Hongxiang Lin; Zhirui Kuai; Erpeng Xue; Lei Wang

arXiv:2605.19444·cs.LG·May 20, 2026

When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

PDF

1 Repo

TL;DR

The paper identifies a flaw in test-time reinforcement learning where majority voting can cause irreversible damage to problem-solving ability, and proposes a framework to mitigate this issue, improving performance on benchmarks.

Contribution

It introduces TTRL-Guard, a novel framework with mechanisms to prevent the extinction of correct answers during test-time reinforcement learning.

Findings

01

TTRL-Guard outperforms previous methods on multiple benchmarks.

02

The framework reduces the irreversible suppression of correct answers.

03

Experiments show a 54% relative improvement on AIME 2025.

Abstract

Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose \textbf{TTRL-Guard}, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

linhxkkkk/TTRL-Guard
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.