ESPO: Entropy Importance Sampling Policy Optimization
Yuepeng Sheng, Yuwei Huang, Shuman Liu, Anxiang Zeng, Haibo Zhang

TL;DR
ESPO introduces a novel reinforcement learning framework that combines entropy-based importance sampling and adaptive clipping to improve training stability and efficiency for large language models on complex reasoning tasks.
Contribution
It proposes a new method that decomposes sequences by entropy to enhance gradient utilization and stability in RL training of language models.
Findings
Accelerates convergence in mathematical reasoning benchmarks.
Achieves state-of-the-art performance on complex reasoning tasks.
Improves training stability and efficiency through entropy-based techniques.
Abstract
Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving performance at scale often introduces a fundamental trade-off between training stability and training efficiency. Token-level optimization applies fine-grained updates at the individual units, but is prone to high variance in gradient estimation, which can result in unstable training dynamics. In contrast, Sequence-level optimization often relies on aggressive clipping mechanisms to ensure stable updates. However, such design may discard a large fraction of valid training samples, leading to inefficient gradient utilization and reduced training efficiency. We refer to this phenomenon as gradient underutilization. In this work, we propose Entropy Importance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Natural Language Processing Techniques
