Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models
Yuelin Hu, Zhengxue Cheng, Wei Liu, Li Song

TL;DR
The paper introduces EGSPO, a token-level gradient modulation method for hybrid large language model training, improving reasoning benchmarks with minimal extra computation.
Contribution
It proposes a novel entropy gated gradient allocation mechanism that enhances exploration and knowledge retention during hybrid training of large language models.
Findings
Improves AIME scores by 3.8% over baseline.
Enhances MATH benchmark performance by 2.9%.
Adds only 3.4% computational overhead.
Abstract
Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
