TL;DR
VESPO introduces a variational sequence-level policy optimization method that stabilizes off-policy training of large language models, reducing variance and improving performance in tasks like math reasoning and code generation.
Contribution
The paper presents a novel variational formulation for sequence-level importance weight reshaping, providing a principled approach to stabilize off-policy LLM training.
Findings
VESPO maintains stable training under severe off-policy conditions.
VESPO outperforms recent reshaping baselines in experiments.
VESPO improves performance in math reasoning and code generation tasks.
Abstract
Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an unbiased correction but suffers from high variance, which is amplified by unbounded ratios and autoregressive generation. Prior remedies either rely on scenario-specific engineering, or trade bias for variance via token-level clipping or sequence-level normalization, yet these approaches remain largely heuristic. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By explicitly incorporating variance reduction into a variational formulation, we derive a principled closed-form reshaping kernel that operates directly on sequence-level importance weights, avoids token-level approximation and length normalization, and admits an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
