TL;DR
This paper introduces DGPO, a method to enable small language models to perform agentic search and planning behaviors effectively, overcoming training challenges through teacher guidance.
Contribution
It presents DGPO, a novel distillation-guided policy optimization technique that enhances agentic capabilities in compact language models, with a new metric ARC for analysis.
Findings
DGPO enables small models to perform sophisticated agentic search behaviors.
Compact models with DGPO can outperform larger teachers in some agentic tasks.
The approach makes agentic RAG feasible in resource-constrained settings.
Abstract
Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g., 0.5--1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic…
| Question: Whose album was Red? |
|---|
| <think> To answer this question, I need to find out which artist released the album titled "Red". </think> |
| <search> Red album artist </search> |
| <information> Doc 1(Title: "Red (Taylor Swift album)") as Jennifer Lopez, Ed Sheeran (who also served as an … |
| Doc 2(Title: "Red (band)") deal with Christian label Essential Records in 2004. Their debut albums … |
| Doc 3(Title: "Red (Dia Frampton album)") Red (Dia Frampton album) Red is the debut … </information> |
| <think> From the information provided, it is clear that the album "Red" was released by Taylor Swift. It was her fourth studio album and had a country sound. </think> |
| <answer> Taylor Swift </answer> |
| Qwen 2.5 (3B 0.5B) | NQ | TriviaQA | PopQA | HotpotQA | 2wiki | MuSiQue | Bamboogle | Avg. |
| Student-0.5B | 0.004 | 0.006 | 0.007 | 0.007 | 0.015 | 0.000 | 0.000 | 0.006 |
| Teacher-3B | 0.365 | 0.569 | 0.393 | 0.340 | 0.368 | 0.135 | 0.298 | 0.353 |
| PPO Jin et al. (2025) | 0.306 | 0.444 | 0.379 | 0.205 | 0.218 | 0.041 | 0.073 | 0.238 |
| SFT PPO | 0.338 | 0.415 | 0.359 | 0.296 | 0.275 | 0.088 | 0.250 | 0.289 |
| GKD Agarwal et al. (2024) | 0.266 | 0.408 | 0.358 | 0.216 | 0.217 | 0.055 | 0.161 | 0.240 |
| SeqKD Kim and Rush (2016) | 0.331 | 0.416 | 0.364 | 0.283 | 0.273 | 0.089 | 0.169 | 0.275 |
| KD Hinton et al. (2015) | 0.331 | 0.431 | 0.373 | 0.286 | 0.284 | 0.091 | 0.290 | 0.298 |
| DistiLLM Ko et al. (2024) | 0.333 | 0.442 | 0.373 | 0.288 | 0.270 | 0.095 | 0.209 | 0.287 |
| TAID Shing et al. (2025) | 0.325 | 0.427 | 0.365 | 0.290 | 0.270 | 0.079 | 0.218 | 0.282 |
| DGPO (ours) | 0.378 | 0.481 | 0.402 | 0.342 | 0.303 | 0.120 | 0.274 | 0.329 |
| Model family | Qwen 2.5 | Llama 3 | |
|---|---|---|---|
| Student size | 0.5B | 1B | |
| Teacher size | 3B | 7B | 8B |
| Student | 0.006 | 0.006 | 0.039 |
| Teacher | 0.353 | 0.385 | 0.438 |
| PPO | 0.238 | 0.238 | 0.250 |
| KD | 0.298 | 0.280 | 0.347 |
| DGPO | 0.329 | 0.323 | 0.389 |
| Models | NQ | MuSiQue | ||
|---|---|---|---|---|
| Qwen2.5 (3B 0.5B) | w/o | w/ thinking | w/o | w/ thinking |
| Student-0.5B | 0.386 | 0.034 | 0.166 | 0.013 |
| Teacher-3B | 0.589 | 0.560 | 0.413 | 0.357 |
| PPO | 0.547 | 0.581 | 0.258 | 0.242 |
| KD | 0.540 | 0.544 | 0.321 | 0.256 |
| DGPO | 0.565 | 0.593 | 0.312 | 0.287 |
| Method |
init (KD) |
pipeline |
KL penalty |
NQ |
TriviaQA |
PopQA |
HotpotQA |
2wiki |
MuSiQue |
Bamboogle |
Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DGPO | ✓ |
KD PPO |
selective | 0.378 | 0.481 | 0.402 | 0.342 | 0.303 | 0.120 | 0.274 | 0.329 |
| (a) w/o
cold-start initialization |
– |
KD PPO |
selective | 0.370 | 0.465 | 0.394 | 0.330 | 0.299 | 0.117 | 0.266 | 0.320 |
| (b) w/o selective kl penalty | ✓ |
KD PPO |
uniform | 0.362 | 0.464 | 0.394 | 0.323 | 0.306 | 0.114 | 0.234 | 0.314 |
| (c) w/o teacher guidance | ✓ |
KD PPO |
– | 0.353 | 0.455 | 0.384 | 0.316 | 0.287 | 0.098 | 0.250 | 0.306 |
| (d) invert pipeline order | – |
PPO KD |
– | 0.320 | 0.426 | 0.371 | 0.287 | 0.282 | 0.084 | 0.234 | 0.286 |
| System Template for qwen2.5 series |
|---|
| You are Qwen, created by Alibaba Cloud. You are a helpful assistant. |
| Instruction Template |
| Answer the given question. You must conduct reasoning inside <think> and </think> first every time you get new information. After reasoning, if you find you lack some knowledge, you can call a search engine by <search> query </search> , and it will return the top searched results between <information> and </information> . You can search as many times as you want. If you find no further external knowledge needed, you can directly provide the answer inside <answer> and </answer> without detailed illustrations. For example, <answer> xxx </answer> . Question: question. |
| Config | Parameter | Value |
| RL parameters | Total training steps | 1000 |
| Batch size | 512 | |
| KL divergence coefficient | 0.001 | |
| Maximum prompt length | 4096 | |
| Maximum response length | 500 | |
| Maximum conversation turns | 4 | |
| Top-k retrieved documents | 3 | |
| Actor learning rate | 1e-6 | |
| Critic learning rate | 1e-5 | |
| KD parameters | Tortal epochs | 5 |
| (initialization) | Batch size | 64 |
| Learning rate | 1e-4 | |
| KL divergence ratio | 1.0 | |
| DistiLLM Ko et al. (2024) | Skew KLD target weight | 0.1 |
| TAID Shing et al. (2025) | 0.4 | |
| 1.0 | ||
| Updating interpolation () | 5e-4 | |
| Momentum coefficient () | 0.99 |
| Dataset | Training samples | Test samples | License |
|---|---|---|---|
| Natural Questions (NQ) (Kwiatkowski et al., 2019) | 79,168 | 3,610 | CC BY-SA 3.0 |
| TriviaQA (Joshi et al., 2017) | – | 11,313 | Apache-2.0 |
| PopQA (Mallen et al., 2023) | – | 14,267 | MIT |
| HotpotQA (Yang et al., 2018) | 90,447 | 7,405 | CC BY-SA 4.0 |
| 2WikiMultiHopQA (Ho et al., 2020) | – | 12,576 | Apache-2.0 |
| MuSiQue (Trivedi et al., 2022) | – | 2,417 | CC BY 4.0 |
| Bamboogle (Press et al., 2023) | – | 125 | MIT |
| Setting | Results |
KD (initialization) |
PPO Loss |
GRPO Loss |
Selective KL penalty |
Uniform KL penalty |
|---|---|---|---|---|---|---|
| DGPO | Tab. 2 | ✓ | ✓ | ✓ | ||
| w/ GRPO | Fig. 6 | ✓ | ✓ | ✓ | ||
| w/o cold-start initialization | Tab. 6 | ✓ | ✓ | |||
| w/o selective KL penalty (uniform KL penalty) | Tab. 6 | ✓ | ✓ | ✓ | ||
| w/o teacher guidance (KD→PPO) | Tab. 6 | ✓ | ✓ | |||
| invert pipeline order (PPO→KD) | Tab. 6 | ✓ | ✓ | |||
| KD→GKD | Fig. 6 | ✓ | ✓ | |||
| PPO (Jin et al., 2025) | Tab. 2 | ✓ | ||||
| KD (Hinton et al., 2015) | Tab. 2 | ✓ | ||||
| GKD (Agarwal et al., 2024) | Tab. 2 | ✓ |
| Qwen 2.5 (7B 0.5B) | NQ | TriviaQA | PopQA | HotpotQA | 2wiki | MuSiQue | Bamboogle | Avg. |
| Student-0.5B | 0.004 | 0.006 | 0.007 | 0.007 | 0.015 | 0.000 | 0.000 | 0.006 |
| Teacher-7B | 0.393 | 0.610 | 0.397 | 0.370 | 0.414 | 0.146 | 0.368 | 0.385 |
| PPO Jin et al. (2025) | 0.306 | 0.444 | 0.379 | 0.205 | 0.218 | 0.041 | 0.073 | 0.238 |
| KD Hinton et al. (2015) | 0.338 | 0.428 | 0.371 | 0.288 | 0.223 | 0.100 | 0.210 | 0.280 |
| DGPO (ours) | 0.371 | 0.474 | 0.396 | 0.334 | 0.257 | 0.113 | 0.315 | 0.323 |
| Llama 3 (8B 1B) | NQ | TriviaQA | PopQA | HotpotQA | 2wiki | MuSiQue | Bamboogle | Avg. |
| Student-1B | 0.052 | 0.080 | 0.044 | 0.027 | 0.042 | 0.001 | 0.024 | 0.039 |
| Teacher-8B | 0.475 | 0.647 | 0.448 | 0.427 | 0.443 | 0.179 | 0.444 | 0.438 |
| PPO Jin et al. (2025) | 0.354 | 0.499 | 0.394 | 0.222 | 0.181 | 0.037 | 0.065 | 0.250 |
| KD Hinton et al. (2015) | 0.406 | 0.508 | 0.405 | 0.369 | 0.355 | 0.119 | 0.266 | 0.347 |
| DGPO (ours) | 0.448 | 0.553 | 0.437 | 0.412 | 0.379 | 0.155 | 0.339 | 0.389 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Can Compact Language Models Search Like Agents? Distillation-Guided
Policy Optimization for Preserving Agentic RAG Capabilities
**Rikuto Kotoge1 ** **Mai Nishimura2 ** Jiaxin Ma2
1The University of Osaka 2OMRON SINIC X Corporation
[email protected] 2{mai.nishimura, jiaxin.ma}@sinicx.com Work done as a research intern at OMRON SINIC X.
Abstract
Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (*e.g., *0.5–1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARCap), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.
Project omron-sinicx.github.io/dgpo
Code omron-sinicx/dgpo
Models omron-sinicx/dgpo
**Can Compact Language Models Search Like Agents? Distillation-Guided
Policy Optimization for Preserving Agentic RAG Capabilities**
** Rikuto Kotoge1††thanks: Work done as a research intern at OMRON SINIC X. Mai Nishimura2 Jiaxin Ma2**
1The University of Osaka 2OMRON SINIC X Corporation
[email protected] 2{mai.nishimura, jiaxin.ma}@sinicx.com
1 Introduction
Agentic RAG Singh et al. (2025) has emerged as a new paradigm where LLMs function as autonomous search agents, coordinating retrieval, query reformulation, and evidence integration. While externalizing knowledge storage, these systems require sophisticated reasoning abilities within the LLMs for effective search coordination. Consequently, existing agentic RAG systems predominantly rely on large language models with billions of parameters Xu and Peng (2025), leaving the potential of agentic RAG in resource-constrained environments largely unexplored. The emergence of small language models (SLMs) (Belcak et al., 2025), particularly compact models (e.g., 0.5–1B) raises a compelling question: can we unlock the latent potential of compact language models to acquire the art of agentic RAG?
Eliciting agentic search capabilities from smaller language models typically requires two approaches: reinforcement learning (RL) via self-exploration and knowledge distillation (KD) from a teacher model. We refer to the compact model under training as the student, regardless of the approach. Yet both approaches become largely ineffective for compact models (0.5–1B) due to their poor initial capabilty. RL Schulman et al. (2017); Shao et al. (2024) suffers from sparse rewards and poor exploration due to weak student-generated outputs (SGOs). Standard KD Hinton et al. (2015); Shing et al. (2025) using only teacher-generated outputs (TGOs) leads to exposure bias Bengio et al. (2015) while on-policy distillation methods Gu et al. (2024); Agarwal et al. (2024) also suffer from the noisy and low-quality nature of SGOs. Neither approach addresses the fundamental bottleneck of poor initial output quality in compact models.
To overcome this fundamental bottleneck, we propose Distillation-Guided Policy Optimization (DGPO), a novel RL framework that addresses the core issue of low-quality SGOs through the strategic integration of teacher guidance and RL (Figure˜1). DGPO operates through two key mechanisms. First, cold-start initialization through KD using TGOs dramatically stabilizes early training by providing high-quality initial trajectories. Second, selective teacher guidance during RL that rewards correct self-reasoning while providing teacher mimicry for incorrect attempts. This synergy between selective KL-based teacher guidance and RL-driven self-exploration allows the compact model to discover policies that outperform the teacher in some experimental settings.
To understand how DGPO preserves agentic capability in compact models, we introduce Agentic RAG Capabilities (ARCap), a fine-grained evaluation framework that decomposes the agentic search into three core dimensions: thinking, query rewriting, and source referencing (Figure 2). Unlike conventional metrics that focus on final accuracy, ARCap evaluates the agentic search process, revealing how different aspects of agentic behavior emerge and decline across different models. Comprehensive evaluations demonstrate that DGPO consistently outperforms baselines in final accuracy. ARCap reveals that DGPO improves multi-hop reasoning and coordination while maintaining competitive performance in source referencing and query rewriting. Such capability-level insights are crucial for advancing agentic RAG in compact models.
Our contributions are summarized in four key dimensions. (i) Problem: we pioneer the challenging domain of agentic RAG post-training for extremely compact models (0.5–1B), identifying fundamental challenges that existing methods fail to address. (ii) Methodology: We propose Distillation-Guided Policy Optimization (DGPO), an RL framework designed to stabilize training in compact models via cold-start initialization and selective teacher guidance. (iii) Evaluation: we present ARCap, a capability-level evaluation framework that provides a detailed diagnosis of agentic behavior. (iv) Results: DGPO outperforms RL and distillation baselines across multiple model families and sizes. Remarkably, our method achieves teacher-surpassing performance on several datasets.
2 Related Work
Agentic RAG.
WebGPT Nakano et al. (2022) introduced RLHF-driven browser interaction for retrieval-grounded QA. ReAct Yao et al. (2023) generalized this idea by interleaving chain-of-thought and tool calls via special <think> or <act> tokens. To tighten the coupling between retrieval and reasoning, IRCoT Trivedi et al. (2023) explicitly alternates each chain-of-thought (CoT) step with a retrieval. Adaptive-RAG Wang et al. (2025) further predicts retrieval steps based on question complexity. Most recently, Search-R1 Jin et al. (2025) leveraged RL to teach an LLM to generate multi-turn search queries, achieving state-of-the-art results. Our work specifically focuses on enabling agentic RAG in compact models and introduces a comprehensive evaluation framework for multi-dimensional capability evaluation.
Post-training for LLMs.
RL algorithms such as PPO Schulman et al. (2017) and GRPO Shao et al. (2024) have proven effective in enhancing reasoning capabilities for LLMs Comanici et al. (2025); Yang et al. (2025), particularly in domains like mathematical problem solving. At the initial stage of training, base models require sufficient performance to obtain meaningful rewards; otherwise, sparse reward signals lead to training instability. To address this cold-start problem, DeepSeek-R1 Guo et al. (2025) demonstrates that SFT-based model initialization effectively warms up the model prior to RL, achieving favorable results through CoT demonstrations. Our work is the first to integrate distillation principles into both cold-start initialization and concurrent RL training, enabling stable distillation-guided learning in compact models.
Knowledge Distillation for LLMs.
Knowledge distillation (KD) (Hinton et al., 2015) enables smaller student models to learn from larger teacher models by matching softened output distributions. To mitigate the capacity gap between student and teacher models Mirzadeh et al. (2020); Zhang et al. (2023a), some methods use interpolated or smoothed intermediate student and teacher distributions Ko et al. (2024); Shing et al. (2025). However, because these methods rely on TGOs during training while inference still uses the SGOs, a train–inference mismatch arises, leading to exposure bias Bengio et al. (2015). To mitigate this, recent work also proposes on-policy distillation from SGOs Agarwal et al. (2024); Gu et al. (2024); Yang et al. (2025), where the student learns directly from its own generated outputs during training. Another limitation is that distillation methods require sensitive teacher-guidance schedulers Ko et al. (2024); Shing et al. (2025), which are hard to tune and affect stability. Meanwhile, conventional single-hop RAG distillation methods (Kang et al., 2023; Zhang et al., 2023b; Liao et al., 2025) typically use sequence-level knowledge distillation (SeqKD) (Kim and Rush, 2016), an SFT-based approach where students learn from teacher-generated CoT demonstrations rather than matching output distributions. While SeqKD improves the reasoning capability of SLMs (Guo et al., 2025), cross-entropy training with hard targets alone is insufficient to transfer the richer distributional knowledge from the teacher (Hinton et al., 2015). Our method addresses these challenges through a two-stage approach: off-policy KD on TGOs for cold-start initialization, followed by off-policy KD-driven RL optimization on SGOs to further extract model capabilities without requiring handcrafted schedulers.
3 Agentic RAG
3.1 Framework
While agentic RAG Singh et al. (2025) encompasses diverse interpretations, we define it through two key characteristics: (1) autonomous, adaptive search coordination behaviors, and (2) three fundamental actions—thinking, searching, and answering—that constitute agentic intelligence. In agentic RAG, LLMs function as agents making sequential decisions at each timestep . Given a user question and an external retrieval system , the LLM agent operates as a policy , where
[TABLE]
As demonstrated in Table˜1, we employ structured tokens Jin et al. (2025) to organize the actions: <think> for reasoning,
<search> for database queries, <information> for retrieved documents, and <answer> for final responses. The model can invoke <think> and <search> at arbitrary times and for an arbitrary number of steps, while <answer> is used exactly once at the end to produce the final output.3.2 Agentic RAG Capability (ARCap)
We propose Agentic RAG Capability (ARCap) as a comprehensive metric to systematically evaluate agentic behavior across multiple dimensions. As demonstrated in Table˜1, we characterize ARCap through three core dimensions:
Source Referencing.
Accurately incorporating retrieved information into final answers (shown in the <information> and <answer> entries).
Query Rewriting.
Reformulating user questions into effective search queries, as literal keyword matching often fails to retrieve relevant documents. The agent must paraphrase key concepts and introduce related terms to maximize retrieval effectiveness (illustrated by transforming "Whose album was Red?" into "Red album artist" in
<search> ).Thinking.
Making informed decisions about when to retrieve information, which documents contain relevant answers, and how to synthesize multiple pieces of evidence into coherent responses. This involves assessing context sufficiency and integrating retrieved sources in a logically consistent manner (demonstrated in <think> entries).
3.3 Challenges in Compact Models.
Performance Gap.
Our preliminary experiments compared the performance of prompt-based and RL-based agentic RAG models across various model sizes, evaluated on the average of seven QA datasets (Figure 3). Here prompt-based refers to Qwen2.5-instruction checkpoints and RL-based refers to post-trained models using PPO Jin et al. (2025) tailored for agentic RAG. The experimental setup is detailed in Section 5. While RL models boosted performance overall in the context of agentic RAG, smaller models still lagged far behind their larger counterparts. We include this result here to highlight the limitations of applying RL directly to compact models—an observation that motivates our proposed approach, DGPO, introduced in the next section.
Training Instability.
Figure 4 presents the RL training curves of Qwen2.5-0.5B-instrtuct model with PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024) for agentic RAG. Smaller models converge faster but tends to become unstable relatively early in training Jin et al. (2025), preventing further performance gains beyond that point. PPO provides more stable optimization than GRPO but converges slower.
4 DGPO: Distillation-Guided Policy Optimization
4.1 Core Framework
Figure˜5 depicts our framework which combines distillation and reinforcement learning to train compact agentic RAG models through a two-phase learning strategy, eliminating the need for a handcrafted scheduler. Early-stage student-generated outputs (SGOs) are often noisy and unstable, while teacher-generated outputs (TGOs) provide quality guidance but suffer from exposure bias. To address these challenges, we propose two key mechanisms:
Cold-Start Initialization via KD.
In the initial phase, students learn purely from TGOs via knowledge distillation. This provides stable, high-quality trajectories that dramatically improve early training dynamics and establish a strong foundation for subsequent RL optimization.
Selective KL penalty.
During the RL phase, we apply KL divergence penalties selectively—only to incorrect predictions—guiding students toward informative teacher behaviors while preserving exploration capabilities. This targeted regularization enables autonomous reasoning development without being overly constrained by the teacher model.
4.2 KD initialization with TGOs
During the cold-start phase, we initialize the student model by distilling from a strong teacher policy using a general KD loss that combines cross-entropy from hard labels and KL divergence as:
[TABLE]
where denotes the student policy and is the frozen teacher. We filter TGOs to retain only correct outputs, ensuring the student learns from high-quality teacher samples.
4.3 Distillation-guided RL with SGOs
Upon reaching a performance threshold, we transition to PPO-based RL using the distilled student as the initial policy. This staged approach stabilizes training dynamics and improves sample efficiency, particularly when the student model has significantly fewer parameters than the teacher. By avoiding premature exploration from a weak policy, our method ensures that RL begins with a reasonable approximation of agentic behaviors.
PPO with Search Engine
Proximal Policy Optimization (PPO) Schulman et al. (2017) is a widely used RL algorithm for LLM fine-tuning, offering stable training for compact models. Our method optimizes LLMs with search engine by maximizing the following objective,
[TABLE]
where and represent the current and previous student policy models, respectively. denotes input samples and represent the generated outputs interleaved with search engine calling results. The term is a clipping-related hyperparameter introduced in PPO to stabilize training. The advantage estimate is computed using Generalized Advantage Estimation (GAE) (Schulman et al., 2018), based on future rewards and a learned value function. is a token loss masking operation. See Sect.˜B.1 for details on token masking.
Reward and Selective KL penalty
We employ binary exact matching (EM) for answer rewards to prevent reward hacking:
[TABLE]
where is the predicted answer and is the ground-truth. However, Eq.˜3 provides no learning signal for incorrect predictions, causing training stagnation with poor SGOs. To address this, we introduce selective KL penalty. The student receives reward for correct self-reasoning, but when incorrect, the teacher guides the student to mimic teacher behavior through KL regularization,
[TABLE]
As illustrated in Figure˜5, our approach differs fundamentally from standard PPO-based LLM tuning. While conventional PPO uses a frozen initial LLM as a reference regularizer to prevent excessive drift from the initial policy, DGPO employs the teacher LLM as an active guide that steers the student toward correct behaviors when errors occur. This can be seen as a form of targeted regularization Laroche et al. (2019), which allows free exploration during correct predictions but applies corrective guidance through KL penalties when the student fails. By selectively emphasizing high-divergence incorrect outputs, our method focuses learning on error correction while maintaining autonomous reasoning capabilities, resulting in efficient and stable training.
5 Experiments
5.1 Experimental setup
We focus our experiments on addressing the following questions:
1
Do our compact models preserve the overall performance of the teacher model?
2
How well do compact models retain individual ARCap components? (a) Source Referencing, (b) Query Rewriting, (c) Thinking.
3
Which components of our method contribute most to performance improvements?
4
Does our method mitigate training instability in compact models?
Datasets.
We evaluate DGPO on seven benchmark datasets, categorized as follows: (1) General Question Answering: NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), and PopQA (Mallen et al., 2023) datasets, which generally require single-hop searching, i.e., the answer can be derived from a single fact or passage. (2) Multi-Hop Question Answering: HotpotQA (Yang et al., 2018), 2WikiMultiHopQA (Ho et al., 2020), MuSiQue (Trivedi et al., 2022), and Bamboogle (Press et al., 2023) datasets, which require multi-hop searching over multiple evidence across different documents. Please See Sect.˜B.4 in details.
Base Models.
As the base student model, we use Qwen2.5-0.5B-instruct Qwen et al. (2025). For the teacher model, we adopt Search-R1-PPO-3B based on Qwen2.5-3B-instruct.
To assess generalizability across different model sizes and families, we also evaluate variants using Qwen2.5-7B-instruct and Llama 3 (Llama-3.2-1B-Instruct and Llama-3.1-8B-Instruct-based model) (Grattafiori et al., 2024).
Baselines.
We compare our method against baselines from three categories:
- •
Reinforcement Learning: Standard PPO Jin et al. (2025) illustrated in Figure 5 top 111We excluded GRPO Shao et al. (2024) as it proved unstable for compact models, collapsing early due to poor SGOs.. We consider two settings: PPO trained from scratch and PPO with a standard SFT warm start.
- •
On‑policy Distillation on SGOs: GKD Agarwal et al. (2024) minimizes reverse KL divergence between teacher and student distributions on SGOs.
- •
Off‑policy Distillation on TGOs: SeqKD Kim and Rush (2016) applies SFT on teacher outputs; KD Hinton et al. (2015) combines cross-entropy loss with KL divergence; DistiLLM Ko et al. (2024) adopts an adaptive off-policy strategy that integrates both SGOs and TGOs. TAID Shing et al. (2025) employs dynamic scheduling to interpolate from student to teacher distributions. Off-policy methods, except for DistiLLM, train exclusively on correct TGOs222We observed that training on only the correct TGOs led to better performance. .
Detailed configurations for baseline and ablation variants can be found in Appendix C.
Evaluation Metrics.
For all evaluations except the search results shown in Table 5, we use Exact Match (EM) as the evaluation metric, following Jin et al. (2025); Yu et al. (2024).
Retrieval Settings.
We follow Jin et al. (2025) and use the 2018 Wikipedia Karpukhin et al. (2020) as the knowledge source and E5 Wang et al. (2024) as the retriever. We set the number of retrieved passages to 3.
Training Settings.
We used the training sets of NQ and HotpotQA datasets. Training was conducted on NVIDIA 8 × H200 GPUs. Implementation details can be found in Appendix B.
5.2 Main Results (1)
Qwen 3B0.5B.
Table 2 shows the overall performance of different methods across seven QA benchmarks. Our method consistently outperforms all baseline methods on most datasets and achieves the highest average score. Remarkably, our method even surpasses the teacher model on three datasets, as selective teacher guidance stabilizes training while RL enables policy improvements beyond teacher imitation. Among the on-policy methods that rely solely on SGOs, both PPO and GKD exhibit lower performance compared to off-policy distillation methods,
due to the difficulty of the multi-turn agentic RAG task and the student’s near-zero initial performance, which makes SGOs highly noisy.
This result highlights the limitations of SGOs, which tend to be noisy and less informative than TGOs. Due to the severe capacity constraint of the compact model, SFT-based warm-start combined with RL does not lead to substantial additional improvements (0.275→0.289). In contrast, our KD-based initialization distills the full soft output distribution of the teacher rather than only hard targets. The KD-based initialization alone already leads to stronger performance (0.298) than SFT-based initialization and RL. DistiLLM and TAID perform worse than standard KD. In our setting, where the student model starts with extremely low performance, interpolating between the teacher and student distributions might have created noisy or misleading targets, resulting in weaker learning.
Qwen 7B0.5B and Llama 8B1B.
Table 3 shows the average EM scores for models with a larger capacity gap (Qwen2.5 0.5B and 7B) and another model family (Llama3 1B and 8B). DGPO consistently outperforms both PPO and KD across challenging capacity gaps (7–8B0.5–1B) and different model architectures (Qwen vs. Llama3). While Qwen 3B0.5B slightly outperforms Qwen 7B0.5B due to a smaller capacity gap, DGPO effectively exploits compact model potential regardless of the teacher quality. All results can be found in Appendix D.
5.3 ARCap – Source Referencing (2a)
Setup.
To isolate the capability of Source Referencing from other agentic behaviors, we evaluate the model’s accuracy when provided only with the ground-truth supporting contexts (i.e., golden knowledge) as <information> , and forced to answer directly using the <answer> tag. For the MuSiQue dataset, which consists of multi-hop questions requiring multiple supporting documents, we concatenate all relevant ground-truth contexts and supply them as <information> . For the NQ dataset, we use the annotated long answer span as the input <information> . The final answer’s correctness is measured using EM.
Results.
Table 5 (w/o thinking column) shows the results for source referencing capability. Our model achieves the highest score in extracting information from a single context on the NQ dataset. However, on the MuSiQue dataset, the KD model performs best. One possible explanation is that our RL phase may have over-optimized for simpler, single-step examples during RL, leading to suboptimal performance on complex multi-hop questions.
5.4 ARCap – Query Rewriting (2b)
Setup.
To isolate the Query Rewriting capability from other agentic behaviors, we evaluate whether the initial search query formulated by the model can retrieve documents containing the correct answer, using the NQ dataset. As the evaluation metric, we adopt Hit ratio Ma et al. (2023), which measures whether at least one of the retrieved documents includes the correct answer.
Results.
Table 5 (NQ column) shows the results for query rewriting. Interestingly, the PPO model achieves the best performance, even surpassing the teacher model. Our DGPO performs better than KD but reaches a similar hit ratio to the teacher. This may be attributed to our training setup, which mixes both single-hop and multi-hop datasets. Given the limited capacity of the student model, the PPO agent may have focused its exploration on simpler single-hop query writing tasks, rather than the more complex multi-hop reasoning required in other datasets.
5.5 ARCap – Thinking (2c)
Setup.
To evaluate the Thinking capability, we assess how and when the model retrieves and integrates information during the reasoning process. (How:) We provide the ground-truth contexts as <information> and allow the model to perform an additional <think> step immediately after <information> (i.e., the second <think> block in Table 1). Note that such additional thinking was disallowed in the source referencing evaluation (2a). While further retrieval is technically unnecessary, the model is still allowed to perform additional search steps. (When:) We allow multiple retrieval steps and examine whether the model can determine the necessity of additional searches based on intermediate results. In this case, we evaluate both the final Hit ratio and the average number of search steps taken as metrics of efficiency.
Results.
As shown in Table 5 (w/ thinking column), many models, including the teacher, exhibit performance degradation when additional <think> steps are introduced. This suggests that under our smaller model setting, deliberate reasoning through thinking is not crucial for information extraction. Only the RL models improve on the NQ dataset. They may have learned to use thinking to double-check their answers for simpler setting.
As shown in Table 5 (MuSiQue column), while the PPO model performs well in the first retrieval step, our method achieves the highest score for more complex multi-hop reasoning. To achieve higher hit ratios, the distilled model tends to take more search steps. Compared to the teacher, which achieves strong performance with fewer steps due to its larger capacity, our method enables the student to compensate by exploring more extensively.
5.6 Ablation Study (3)
Table 6 presents the results of our ablation study. (a) w/o cold-start initialization by KD, the performance drop is relatively small; however, training becomes unstable and collapses around step 800, so we report the score just before the collapse. (b) w/o selective KL penalty applies KL regularization uniformly across all trajectories, regardless of whether the student’s attempt is correct or incorrect. (c) w/o teacher guidance denotes KD initialization followed by standard PPO without KL regularization during RL. Both variants (b) and (c) result in performance degradation for our method. (d) Reversing the order (PPO before KD) causes substantial performance loss. These results confirm that all proposed components are essential: KD initialization prevents collapse, pipeline KDPPO with selective KL penalty is crucial.
5.7 Training Dynamics (4)
Figure 6 illustrates the training stability of DGPO and its variants across different RL algorithms and initialization strategies. DGPO maintained a stable training curve beyond 1000 steps, achieving the best overall performance. However, (1) replacing PPO with GRPO leads to an early collapse during RL. Even with KD initialization and teacher guidance, GRPO remains unstable for compact models. (2) When removing KD initialization from our model, training remains more stable until 800 steps compared to the standard PPO but collapses at around 800 steps. (3) Using GKD, i.e., teacher guidance only, results in stable learning; however, the absence of self-exploration in RL leads to worse performance. (4) When KD-based initialization is further combined with GKD, training collapses due to the excessive constraints imposed by the teacher.
6 Conclusion
We propose Distillation-Guided Policy Optimization (DGPO), a novel RL framework that overcomes the core challenge of poor SGOs in compact models via cold-start initialization and selective teacher guidance. DGPO transforms the reference model from a passive regularizer to an active guidance mechanism, enabling performance improvements rather than merely preventing degradation. Our two-phase approach achieves consistent improvements without complex scheduling. Beyond end-to-end gains, our ARCap-based analysis provides a fine-grained breakdown of how DGPO improves agentic behavior, highlighting its strengths across dimensions such as source referencing, query rewriting, and multi-hop reasoning.
Can compact language models search like agents?
Our findings suggest yes. Starting from a 0.5B model with minimal performance (0.006), DGPO achieves a 55× improvement (0.329), approaching the 3B teacher’s performance (0.353). Remarkably, our student model even surpasses the teacher on several datasets. Given that 0.5B models can run efficiently on CPUs, our method democratizes access to search agents across computing resource-constrained devices like laptops and smartphones, opening possibilities for more practical agentic RAG deployment. As a foundational study on enabling agentic RAG in compact models, we focus on QA tasks for comprehensive evaluation. Future work will extend this approach to diverse tasks requiring agentic reasoning.
Acknowledgements
This work was supported by the “Development Acceleration Use” program of ABCI 3.0, which is provided by AIST and AIST Solutions, and JST AIP Acceleration Research, Japan, Grant Number JPMJCR23U2 and JST PRESTO, Japan, Grant Number JPMJPR2518.
Limitations
Our experiments are restricted to Qwen2.5 (3B0.5B, 7B0.5B) and Llama3 (8B1B) model families. Given the rapid advancement of LLMs, comprehensive evaluation across all available models is impractical within current research timelines. Due to computational limitations, we restrict our investigation to student models of 0.5–1B parameters and teacher models up to 8B parameters. While larger teacher models are available, this work specifically targets compact models for computing resource-constrained environments, making exploration of massive teacher models beyond both our computational capacity and research scope.
As stated in Section 5, while our model achieves strong overall performance, optimization across all capacity dimensions remains an open challenge. We believe that our ARCap analysis framework and proposed DGPO approach provide essential foundations for enabling compact models to acquire sophisticated agentic behaviors.
Although the use of a teacher model introduces additional overhead, under our setup where a 3B teacher model is used only for inference during RL while a 0.5B student model is fully trained, it increases the overall training time by 9.5%. This overhead is relatively small given the stability and performance improvements achieved.
Our primary motivation is to study distillation for compact models, specifically in the context of agentic RAG. We agree that transferring to other tasks is important. Extending beyond agentic RAG remains valuable future work.
Appendix
Appendix A RL for Agentic RAG
We ground the reinforcement learning framework on the skeletal formalization of Search-R1 Jin et al. (2025), which is one of the state-of-the-art agentic RAG frameworks. We model the agentic search process as a sequential decision-making problem where the LLM agent must learn to coordinate reasoning and retrieval operations. At each step, the agent can either generate text to advance its reasoning or issue queries to the external search engine to gather additional information.
Learning Objective.
The Reinforcement Learning for agentic RAG framework is formulated as:
[TABLE]
where denotes the trainable agent policy that generates action trajectories conditioned on the input user question and an external retrieval system . The reward function evaluates accuracies of generated answers. The KL-divergence term with coefficient provides regularization against the frozen reference policy .
Appendix B Implementation Details
B.1 Token Masking
Following prior work Jin et al. (2025), we employ token masking during training. Eq.˜2, is the loss-masking operator defined as,
[TABLE]
In agentic RAG, the token sequence contains both LLM agent-generated tokens (
<search> , <think> , and <answer> ) and externally retrieved content from the search system $\mathcal{R}$ ( <information> ). Computing gradients over retrieved tokens is counterproductive, as it encourages the model to learn how to generate external content rather than focusing on the core agentic capabilities of when and how to search. To prevent this misallocation of model capacity and stabilize training, we apply loss masking to retrieved tokens and documents, ensuring optimization focuses solely on agent-generated content.B.2 Prompt Template
We used the system template for Qwen2.5 series and the instruction template following Jin et al. (2025). Table 7 shows these templates.
B.3 Training Details
On-policy distillation or RL methods were trained for up to 1000 steps. However, PPO training with a small model is inherently unstable; thus, we report the results at step 200, before training collapse. All models were initialized from the same pretrained checkpoints and trained once. Training took approximately one day on 8×H200 GPUs. The hyperparameters and libraries used for implementation followed those of prior work Jin et al. (2025); Shing et al. (2025). Table 8 shows training parameters.
B.4 Dataset Details
We used preprocessed seven QA datasets following Jin et al. (2025). Table 9 shows dataset statistics. These datasets are originally designed for QA tasks, and our use aligns with their intended purpose.
Appendix C Ablation and Baseline Settings
Table 10 summarizes the ablation and baseline settings used in our study, indicating which components (e.g., KD, PPO loss, GRPO loss, selective or uniform KL penalties) are included in each variant, along with references to the corresponding figures or tables where results are reported.
Appendix D Evaluation across Different Model Families and Larger Capacity Gaps
We evaluate our method across different model families and larger capacity gaps using state-of-the-art compact models (0.5–1B students). Table 11 shows Qwen2.5 (7B 0.5B) results and Table 12 shows Llama 3 (8B 1B) results. Both configurations demonstrate DGPO’s consistent superiority over baseline methods, confirming broad applicability across diverse architectures and capacity gaps.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes . In The Twelfth International Conference on Learning Representations , Cited by: Table 10 , §1 , §2 , 2nd item , Table 2 .
- 2P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, and P. Molchanov (2025) Small language models are the future of agentic ai . External Links: 2506.02153 , Link , Document Cited by: §1 . · doi ↗
- 3S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks . In Advances in Neural Information Processing Systems , Vol. 28 . Cited by: §1 , §2 .
- 4G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, L. Marris, S. Petulla, C. Gaffney, A. Aharoni, N. Lintz, T. C. Pais, H. Jacobsson, I. Szpektor, N. Jiang, K. Haridasan, A. Omran, N. Saunshi, D. Bahri, G. Mishra, E. Chu, T. Boyd, B. Hekman, A. Parisi, C. Zhang, K. Kawintiranon, T. Bedrax-Weiss, O. Wang, Y. Xu, O. Purkiss, U. Mendlovic, I. Deutel, N. Nguyen, A. Langley, F. Korn, L. Rossazza, A. Ramé, S. Waghmare, H. Miller, N. B
- 5A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. Mc Connell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Wyatt, D.
- 6Y. Gu, L. Dong, F. Wei, and M. Huang (2024) Mini LLM: knowledge distillation of large language models . In The Twelfth International Conference on Learning Representations , Cited by: §1 , §2 .
- 7D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. (2025) Deepseek-r 1 incentivizes reasoning in llms through reinforcement learning . Nature 645 ( 8081 ), pp. 633–638 . Cited by: §2 , §2 .
- 8G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop , External Links: Link Cited by: Table 11 , Table 12 , Table 10 , §1 , §2 , 3rd item , Table 2 .
