A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization
Shiye Lei, Zhihao Cheng, Dacheng Tao

TL;DR
This paper introduces MinPRO, a new objective for reinforcement learning in large language models that stabilizes training by using a minimum prefix importance ratio, improving performance in off-policy settings.
Contribution
The paper identifies the instability caused by token-level importance sampling and proposes MinPRO, a simple surrogate that enhances stability and performance during off-policy RL training of LLMs.
Findings
MinPRO significantly improves training stability.
MinPRO achieves higher peak performance.
Effective across various LLM architectures and benchmarks.
Abstract
Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Machine Learning and Data Classification
