SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

Lei Yang; Wei Bi; Chenxi Sun; Renren Jin; Deyi Xiong

arXiv:2601.21476·cs.CL·January 30, 2026

SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

Lei Yang, Wei Bi, Chenxi Sun, Renren Jin, Deyi Xiong

PDF

Open Access

TL;DR

SOUP introduces a token-level mix-policy reinforcement learning framework for large language models, combining off- and on-policy data at the token level to enhance exploration, stability, and performance.

Contribution

It proposes a novel token-level mix-policy paradigm that unifies off- and on-policy learning within individual samples, improving exploration and training stability in LLM RL.

Findings

01

Outperforms standard on-policy training and existing off-policy methods.

02

Enhances exploration and final performance of large language models.

03

Provides analysis on how fine-grained mix-policy improves training.

Abstract

On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $S$ ingle-sample Mix-p $O$ licy $U$ nified $P$ aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning