TL;DR
This paper introduces Entrocraft, a rejection-sampling method that controls entropy in RL for LLMs, preventing performance saturation and improving generalization and diversity.
Contribution
Entrocraft provides a simple, regularization-free approach to precisely schedule entropy, enabling sustained RL training improvements in large language models.
Findings
Entrocraft outperforms baseline models in generalization and diversity.
Linear entropy annealing yields the best performance.
Model performance is sustained longer before plateauing.
Abstract
Reinforcement learning (RL) has enabled complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing continued gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts focus on preventing entropy collapse through regularization or clipping. However, their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions. This explains the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
