Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts
Andrei Baroian, Rutger Berger

TL;DR
Prompt Replay is an efficient online data selection method for GRPO that reuses prompts to accelerate learning, especially in difficult datasets, by focusing on prompts with medium difficulty to maximize learning signals.
Contribution
It introduces Prompt Replay, a novel prompt reuse strategy that improves training efficiency and effectiveness in GRPO by selectively reusing prompts based on their difficulty and pass rate.
Findings
Reduces zero-variance prompts and increases advantage.
Accelerates initial accuracy gains in multiple models.
Plateaus at convergence similar to baseline methods.
Abstract
Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
