Prioritized Replay for RL Post-training
Mehdi Fatemi

TL;DR
This paper presents a model-driven prioritization framework for RL post-training that dynamically selects problems based on success statistics, improving learning efficiency without predefined difficulty tiers.
Contribution
It introduces a novel, automatic prioritization method for RL post-training that adapts to problem success rates without relying on external labels or predefined curricula.
Findings
Effective problem prioritization improves learning signals.
Automatic scheduling focuses on problems with intermediate success.
Practical deployment mechanisms enhance scalability.
Abstract
We introduce a problem-level prioritization framework for RL post-training of large language models. Building on insights from prioritized replay in deep RL, as well as prior observations that rollouts with intermediate success rates tend to produce stronger learning signals under methods such as GRPO, our approach selects problems according to a simple, model-driven priority score derived from empirical success statistics. In contrast to conventional curriculum strategies that emphasize easier tasks early in training, the resulting schedule naturally focuses training on problems that are neither consistently solved nor consistently failed, while deprioritizing those that contribute little gradient information. The method yields a continuously adapting and automatic prioritization process that requires no predefined difficulty tiers, auxiliary predictors, or external labels. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
