Prioritized Replay for RL Post-training

Mehdi Fatemi

arXiv:2601.02648·cs.LG·January 7, 2026

Prioritized Replay for RL Post-training

Mehdi Fatemi

PDF

Open Access

TL;DR

This paper presents a model-driven prioritization framework for RL post-training that dynamically selects problems based on success statistics, improving learning efficiency without predefined difficulty tiers.

Contribution

It introduces a novel, automatic prioritization method for RL post-training that adapts to problem success rates without relying on external labels or predefined curricula.

Findings

01

Effective problem prioritization improves learning signals.

02

Automatic scheduling focuses on problems with intermediate success.

03

Practical deployment mechanisms enhance scalability.

Abstract

We introduce a problem-level prioritization framework for RL post-training of large language models. Building on insights from prioritized replay in deep RL, as well as prior observations that rollouts with intermediate success rates tend to produce stronger learning signals under methods such as GRPO, our approach selects problems according to a simple, model-driven priority score derived from empirical success statistics. In contrast to conventional curriculum strategies that emphasize easier tasks early in training, the resulting schedule naturally focuses training on problems that are neither consistently solved nor consistently failed, while deprioritizing those that contribute little gradient information. The method yields a continuously adapting and automatic prioritization process that requires no predefined difficulty tiers, auxiliary predictors, or external labels. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning