SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models
Lei Yang, Wei Bi, Chenxi Sun, Renren Jin, Deyi Xiong

TL;DR
SOUP introduces a token-level mix-policy reinforcement learning framework for large language models, combining off- and on-policy data at the token level to enhance exploration, stability, and performance.
Contribution
It proposes a novel token-level mix-policy paradigm that unifies off- and on-policy learning within individual samples, improving exploration and training stability in LLM RL.
Findings
Outperforms standard on-policy training and existing off-policy methods.
Enhances exploration and final performance of large language models.
Provides analysis on how fine-grained mix-policy improves training.
Abstract
On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the ingle-sample Mix-plicy nified aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
