EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang

TL;DR
EFRame is a reinforcement learning framework that enhances reasoning in large language models by combining exploration, filtering, and replay techniques to improve training stability and performance on complex tasks.
Contribution
It introduces a novel Exploration-Filter-Replay framework that significantly improves the efficiency and stability of policy optimization for reasoning tasks in LLMs.
Findings
Achieves a 37.9% relative improvement on Geometry3K benchmark.
Supports fine-grained sample categorization and entropy control.
Demonstrates consistent gains across diverse reasoning benchmarks.
Abstract
Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- This article focuses on an important issue, the significance of exploration for RL. - The idea is very simple and extensible to prior methods.
- The experimental design is relatively weak, with too few baselines — aside from the fundamental algorithm GRPO, the comparison includes only one method of the same type (DAPO). The experimental analysis is also insufficient. - The paper does not verify scalability, such as testing across different model architectures or sizes. - All chosen benchmarks are standard math tasks, without any out-of-distribution (OOD) tasks to demonstrate the effectiveness of exploration.
1. Authors provide recipe for stable RL training which includes additional rollouts with higher temperature, online filtering, and experience replay. I believe it's a promising research direction. 2. This paper provides detailed analysis of each introduced mechanism based on the current challenges of GRPO, which is well motivated and reasonable. 3. This paper is well organized and easy to follow.
I discuss the weaknesses of originality and experiments. Weaknesses marked with **W** are key concerns that might affect the final rating, while weaknesses marked with **M** may have minor impact on my rating. ### Originality **[M1]** The core ideas used in this work, *i.e.*, adaptive sampling for hard problems [1][2], online filter [2][3] and experience relay [4][5], have been explored in prior literature. This work combines these existing ideas well, but it's not very inspiring to me. ### Ex
1. The proposed Exploration–Filter–Replay framework is conceptually clear and easy to follow. 2. The method improves training stability and reasoning accuracy compared to GRPO baselines. 3. The ablation experiments provide useful insight into the contribution of each component.
1. Limited novelty: Similar mechanisms have already been explored in RLEP [1], RePO [2] and VL-Rethinker [3], which all employ replay-based or filtering strategies to stabilize reinforcement learning for reasoning tasks. 2. Baseline insufficiency: The paper does not compare against these closely related works [1–3], making it unclear how much gain is attributable to EFRame itself. 3. Lack of exploration metrics: The claimed improvement in exploration is not supported by pass@k, a standard eval
- The paper is well written and easy to follow. - The experiments are conducted on three diverse datasets and the gains are strong. - The framework has three distinct parts which the authors conduct ablations by isolating the effect of each component.
- While well-engineered, the framework primarily combines known components (resampling, filtering, replay buffer) on top of the existing GRPO framework rather than introducing a fundamentally new optimization principle. - The paper lacks theoretical justifications, and some claims are poorly supported: - In lines 243 - 248, "low-quality samples are significantly more numerous than high-quality ones, ... the informative signal from high-quality samples may be drowned out by chaotic updates fr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling
