TL;DR
This paper introduces RED, a novel method to improve small language models by balancing exploration and offline data integration, addressing exploration limitations and data distribution issues.
Contribution
RED proposes controlled exploration and refined offline integration techniques to enhance reasoning in small language models, a less explored area compared to large models.
Findings
Improved reasoning capabilities in small models.
Effective regulation of offline and online data balance.
Dynamic policy shift enhances learning from offline data.
Abstract
Many existing studies have achieved significant improvements in the reasoning capabilities of large language models (LLMs) through reinforcement learning with verifiable rewards (RLVR), while the enhancement of reasoning abilities in small language models (SLMs) has not yet been sufficiently explored. Combining distilled data from larger models with RLVR on small models themselves is a natural approach, but it still faces various challenges and issues. Therefore, we propose \textit{\underline{R}}ecall-\textit{\underline{E}}xtend \textit{\underline{D}}ynamics(RED): Enhancing Small Language Models through Controlled Exploration and Refined Offline Integration. In this paper, we explore the perspective of varying exploration spaces, balancing offline distillation with online reinforcement learning. Simultaneously, we specifically design and optimize for the insertion problem within offline…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The idea of using entropy to balance different objectives is intuitive and easy to implement. 2. The model is compared with competitive baselines, including unified and stage-wise frameworks.
1. Optimizing SFT and RL objectives concurrently can induce conflicting gradient signals and non-stationary targets, especially when the SFT teacher distribution and the RL policy are far from each other. This can exacerbate training instability that RL-based approaches face. 2. The method is framed as targeted at small language models, but the design and analysis appear model-size agnostic. The paper does not isolate challenges unique to SLM fine-tuning (e.g., weaker in-context learning, shor
- The design of $\pi^{offline}$ is novel to the reviewer and the figure 6 looks good. - The results are evaluated on several benchmarks, though only on one base model.
The general algorithm is not novel, which is a combination of GRPO objective and SFT objective with minimal modifications. However, the intuition of the modifications is not well explained. Please see the "Questions" section. The main experiments are only conducted on Qwen2.5-math-1.5B model, making the empirical improvement less convincing. The theoretical part is highly informal, and the reasons are listed as follows: 1. The proof of "Entropy Preservation" property in Proposition 1 is not ri
1. The proposed Recall-Extend dichotomy is conceptually elegant and provides an intuitive lens for understanding SFT–RL synergy in small models. 2. The proposed entropy-based regulation and accuracy-aware policy offset are both theoretically analyzed with proofs of stability. 3. RED achieves consistent gains in reasoning accuracy and efficiency across multiple datasets.
1. The paper’s core contributions, Dynamic Entropy Regulation and Accuracy-Aware Policy Shift, represent well-motivated but incremental refinements to existing unified SFT–RL frameworks, rather than a fundamentally new paradigm. 2. Figure 1, and Figure 5 report accuracy or entropy trends, but the paper does not specify on which dataset or benchmark these results were obtained. 3. The paper claims that RED improves efficiency but provides no quantitative comparison (e.g., GPU hours or rollout t
The problem of reasoning in small language models is important and well-motivated. Moreover, the authors propose a practical framework which comes with some nice built-in properties.
With that being said, the authors’ contribution is somewhat incremental when compared to previous work in this area. Additionally, the empirical results are also somewhat incremental/inconclusive as to whether (and by how much) this method improves over baselines. The theoretical results could also be proven more rigorously. For example, the “proof” of Proposition B.3 reads more like a proof sketch than an actual proof. Finally, the paper was hard to follow at times (e.g. many long equations a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
