ESPO: Entropy Importance Sampling Policy Optimization

Yuepeng Sheng; Yuwei Huang; Shuman Liu; Anxiang Zeng; Haibo Zhang

arXiv:2512.00499·cs.LG·February 17, 2026

ESPO: Entropy Importance Sampling Policy Optimization

Yuepeng Sheng, Yuwei Huang, Shuman Liu, Anxiang Zeng, Haibo Zhang

PDF

Open Access

TL;DR

ESPO introduces a novel reinforcement learning framework that combines entropy-based importance sampling and adaptive clipping to improve training stability and efficiency for large language models on complex reasoning tasks.

Contribution

It proposes a new method that decomposes sequences by entropy to enhance gradient utilization and stability in RL training of language models.

Findings

01

Accelerates convergence in mathematical reasoning benchmarks.

02

Achieves state-of-the-art performance on complex reasoning tasks.

03

Improves training stability and efficiency through entropy-based techniques.

Abstract

Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving performance at scale often introduces a fundamental trade-off between training stability and training efficiency. Token-level optimization applies fine-grained updates at the individual units, but is prone to high variance in gradient estimation, which can result in unstable training dynamics. In contrast, Sequence-level optimization often relies on aggressive clipping mechanisms to ensure stable updates. However, such design may discard a large fraction of valid training samples, leading to inefficient gradient utilization and reduced training efficiency. We refer to this phenomenon as gradient underutilization. In this work, we propose Entropy Importance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Natural Language Processing Techniques