TL;DR
This paper introduces SCOPE, a confidence-weighted pseudo-labeling framework for test-time reinforcement learning that improves reasoning and exploration, outperforming recent methods on multiple benchmarks.
Contribution
SCOPE integrates confidence estimation and dynamic subgroup partitioning to enhance pseudo-label quality and exploration in test-time reinforcement learning.
Findings
SCOPE achieves 13.1% relative improvement on AIME 2025.
SCOPE outperforms recent baselines across various benchmarks.
The code is publicly available at https://github.com/szu-tera/SCOPE.
Abstract
Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability. However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label estimation, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
