VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen; Chenxiao Zhao; Xiang Cheng; Lei Huang; Xing Yu

arXiv:2602.10693·cs.LG·May 11, 2026

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu

PDF

1 Repo

TL;DR

VESPO introduces a variational sequence-level policy optimization method that stabilizes off-policy training of large language models, reducing variance and improving performance in tasks like math reasoning and code generation.

Contribution

The paper presents a novel variational formulation for sequence-level importance weight reshaping, providing a principled approach to stabilize off-policy LLM training.

Findings

01

VESPO maintains stable training under severe off-policy conditions.

02

VESPO outperforms recent reshaping baselines in experiments.

03

VESPO improves performance in math reasoning and code generation tasks.

Abstract

Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an unbiased correction but suffers from high variance, which is amplified by unbounded ratios and autoregressive generation. Prior remedies either rely on scenario-specific engineering, or trade bias for variance via token-level clipping or sequence-level normalization, yet these approaches remain largely heuristic. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By explicitly incorporating variance reduction into a variational formulation, we derive a principled closed-form reshaping kernel that operates directly on sequence-level importance weights, avoids token-level approximation and length normalization, and admits an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FloyedShen/VESPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.