TL;DR
Learning-Zone Energy (LZE) is an online data selection method for reinforcement learning in large language models that efficiently focuses training on the most informative prompts, reducing compute while maintaining or improving performance.
Contribution
LZE introduces a theoretically grounded, fully online scoring framework that dynamically concentrates training on the model's active learning frontier, improving efficiency over uniform sampling methods.
Findings
Retains only 40% of training data per step while matching or surpassing full-data baselines.
Achieves +45.9% out-of-distribution gains on AIME25 and +18.2% on AMC23.
Reduces training FLOPs by an estimated 36%.
Abstract
Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning-Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed-form Learning-Zone Energy Score that fuses three complementary signals, an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum, into a single scalar that is provably aligned with the expected magnitude of group-relative policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
