TL;DR
The paper introduces UEC-RL, a unified framework for targeted exploration and stabilization in reinforcement learning, significantly improving reasoning capabilities in large models by maintaining exploration diversity and training stability.
Contribution
UEC-RL provides a novel targeted exploration and stabilization mechanism that enhances RL performance in large language and vision-language models.
Findings
UEC-RL achieves a 37.9% relative improvement over GRPO on Geometry3K.
Experiments show UEC-RL improves Pass@1 and Pass@$k$ metrics.
UEC-RL maintains stable training while expanding exploration.
Abstract
Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
