SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
Yifu Huo, Chenglong Wang, Ziming Zhu, Shunjie Xing, Peinan Feng, Tongran Liu, Qiaozhi He, Tianhua Zhou, Xiaojia Chang, Jingbo Zhu, Zhengtao Yu, Tong Xiao

TL;DR
This paper introduces SPS, a novel training paradigm combining RL and IRL to improve exploration and multi-sample success rates in reasoning-oriented large language models.
Contribution
It proposes Steering Probability Squeezing (SPS), a method that reshapes trajectory distributions to enhance exploration in RL for large language models.
Findings
SPS improves Pass@k across five reasoning benchmarks.
Analysis reveals an empirical upper bound on Pass@k in RL-based reasoning.
Alternating RL and IRL enhances exploration capacity in large language models.
Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
