AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
Jiahao Qiu, Xinzhe Juan, Yimin Wang, Ling Yang, Xuan Qi, Tongcheng Zhang, Jiacheng Guo, Yifu Lu, Zixin Yao, Hongru Wang, Shilong Liu, Xun Jiang, Liu Leqi, and Mengdi Wang

TL;DR
AgentDistill introduces a training-free framework for agent distillation that leverages reusable MCP modules, enabling small language models to generalize and perform well across diverse tasks without extensive training.
Contribution
This work presents a novel, training-free agent distillation method using MCPs, allowing scalable knowledge transfer and improved generalization in small language models.
Findings
Student agents achieve performance comparable to large LLMs.
MCP reuse enables effective generalization across domains.
Framework reduces training costs and complexity.
Abstract
While knowledge distillation has become a mature field for compressing large language models (LLMs) into smaller ones by aligning their outputs or internal representations, the distillation of LLM-based agents, which involve planning, memory, and tool use, remains relatively underexplored. Existing agent distillation methods typically replay full teacher trajectories or imitate step-by-step teacher tool usage, but they often struggle to train student agents to dynamically plan and act in novel environments. We propose AgentDistill, a novel, training-free agent distillation framework that enables efficient and scalable knowledge transfer via direct reuse of Model-Context-Protocols (MCPs), which are structured and reusable task-solving modules autonomously generated by teacher agents. The reuse of these distilled MCPs enables student agents to generalize their capabilities across domains…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Authors propose an interesting workflow of distillation for agents. 2. The paper is well-written with effective visualizations (Figures 2, 3, 5) that clearly illustrate the MCP construction pipeline and case studies. 3. They show obvious improvements across datasets, for example, particularly dramatic gains on Game of 24 (+48.4% for GPT-3.5-turbo) and competitive or superior performance to stronger baselines (matching teacher on PathVQA, nearly matching on SLAKE).
1. The motivation of training-free here is not clear. what is the special benefits of methods compared to training method. From my point of view, such training-free methods especially for small language model require much manual efforts for prompt engineering to make output format as expected. Did authors use the same prompt across small models? And how to make it fair? However, fine-tuning method can make output format very structured as expected. Even if fine-tuning require more computing reso
- Improve the performance of smaller models is an important research directions. The use of reusable tools (in this case presented as MCPs) makes sense. In addition to be modular, it as the benefit of being more interpretable and easier to apply guardrails (e.g., safety constraints, controlling access, etc.). - I like the opportunity for the MCP toolbox to keep improving over time. It could be interesting to further study this in the context of curriculum learning. - Being training-free makes
- It is unclear to me how general the generated MCP components are. The authors show that they can be reused across tasks within the same dataset, but it is unclear to me how well they would transfer to completely different tasks or domains. For example, would MCP components extracted from one biomedical dataset useful for the other one still staying in the same domain? - Overall this seems like a complex 3-stage pipeline. It would be interesting to see an ablation study that quantifies the con
1. The writing is clear and easy to understand. 2. The experiment result can prove this method can improve the performance of the small model.
1. The motivation is quite weird. Considering why we want to distill Large model's ability to small model, it's because it can **improve the performance of the small model**. As a result, the key is to find a way to improve the performance of the small model. So "How to construct more task-related MCPs based on a given environment" maybe a good question since good MCPs makes good context-management which leads to (small model's) better performance. The over emphasized part in distill may mislead
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Metaheuristic Optimization Algorithms Research · Scheduling and Optimization Algorithms
MethodsKnowledge Distillation
