TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

Shenzhi Yang; Guangcheng Zhu; Xing Zheng; Yingfan MA; Zhongqi Chen; Bowen Song; Weiqiang Wang; Junbo Zhao; Gang Chen; Haobo Wang

arXiv:2512.13106·cs.LG·December 16, 2025

TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

Shenzhi Yang, Guangcheng Zhu, Xing Zheng, Yingfan MA, Zhongqi Chen, Bowen Song, Weiqiang Wang, Junbo Zhao, Gang Chen, Haobo Wang

PDF

Open Access 3 Reviews

TL;DR

TraPO introduces a semi-supervised reinforcement learning framework that significantly improves reasoning model performance by effectively leveraging limited labeled data alongside unlabeled data, achieving high accuracy with less supervision.

Contribution

The paper proposes TraPO, a novel semi-supervised RLVR method that stabilizes training using a small labeled set, enhancing data efficiency and outperformance of fully supervised models.

Findings

01

TraPO achieves 42.6% accuracy with only 1K labeled and 3K unlabeled samples.

02

It surpasses unsupervised methods trained on much larger unlabeled datasets.

03

With 4K labeled and 12K unlabeled samples, TraPO outperforms fully supervised models.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

(1) My biggest takeaway is that carefully curated mixture of labeled and unlabeled data could improve the performance and solve the unstability issue raised in the unsupervised training. (2) The paper provides rigorous theoretical analysis, including generalization and convergence proofs linking trajectory consistency with domain adaptation theory. (3) The paper presentation is good and easy to follow, also comes with comprehensive baselines and benchmarks.

Weaknesses

(1) My main concern is the practical usage of the method, to my understanding, the method works as a preprocessing data selection step before launching the training on mixture data, what would be the cost and running time for this step? if it needs a long trajectory to determine good samples, is this process even more costly than the rl training itself? Correct me if I am wrong (2) I am not sure if the roll out pass rate is a 'stable' indicator of this problem, as everytime the roll out could b

Reviewer 02Rating 4Confidence 4

Strengths

- Tackles an important and practical problem in semi-supervised RLVR. - The proposed approach is overall reasonable and well-motivated. - The paper is clearly written and well-organized. - Experimental results show promising improvements.

Weaknesses

- Some design choices require stronger justification (see below). - Missing key ablation studies to verify and better understand design components.

Reviewer 03Rating 4Confidence 3

Strengths

1. **Clean, intuitive mechanism:** “learn from how learning evolves,” not from point estimates. The trajectory-matching mask is simple to implement atop GRPO and robust to domain shift. 2. **Label efficiency & breadth:** Strong results with tiny labeled sets; both ID and OOD improvements; ablations with OOD unlabeled data are convincing. . **Theoretical framing:** A readable (if high-level) bound linking trajectory similarity and confidence to generalization; clarifies the role of labeled anch

Weaknesses

1. **Novelty positioning:** Semi-supervised filtering by “learning dynamics” echoes curriculum/confidence-based selection; paper could contrast more sharply vs. entropy/self-certainty and preference-based filtering beyond empirical gains. 2. **Proxy fidelity:** Unlabeled “pseudo-pass rates” depend on majority voting; risk of reinforcing easy/short answers remains. Need stress tests where voting is systematically biased. 3. **Sensitivity & cost:** Top-p/Γ thresholds, warm-up length, rollouts (G

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)