Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL
Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang

TL;DR
This paper introduces Chain-of-Agents, a novel end-to-end LLM reasoning paradigm that simulates multi-agent collaboration within a single model, achieving state-of-the-art results through multi-agent distillation and reinforcement learning.
Contribution
The paper presents a new framework for end-to-end multi-agent problem solving in LLMs, combining multi-agent distillation and agentic RL to create Agent Foundation Models with superior performance.
Findings
Achieves new state-of-the-art on diverse benchmarks
Demonstrates effective multi-agent collaboration within a single model
Provides open-source models, code, and data for future research
Abstract
Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To…
Peer Reviews
Decision·Submitted to ICLR 2026
- The empirical results of this work are strong, with state-of-the-art performance across a variety of benchmarks across various applications, including deep search, math, and MHQA tasks. - The token reduction compared to multi-agent systems is significant, reducing inference cost (as measured in tokens) by an impressive 84%. - The ablations suggest that both the SFT and RL components of the training pipeline contribute meaningfully, which is a useful ablation for understanding which component
- The framing around AFMs seems to be undermined, to some degree, by the fact that the proposed AFMs are domain-specific, requiring specialized AFMs (specialization is at odds with the colloquial meaning of the term "foundation model"). Section 5.3 described one example in which cross-domain generalization was observed, but this does not seem like strong enough evidence to claim that the proposed models are general-purpose agentic foundation models. - I may have missed something, but it remains
1. The main strength of this paper lies in its experimental section. The results are compelling, with strong evaluations against numerous baselines and models at different scales, including 32B ones. These elements collectively support the paper’s claims effectively. 2. The analysis in Section 5 highlights important contributing factors. It is particularly interesting to observe that test-time computation significantly benefits the trained network, although further clarifications would be helpfu
1. The novelty of the work does not appear to be particularly strong; it seems more like a well-executed engineering contribution. My main concern arises from the comparison with Search-R1. The proposed framework resembles an N-agent extension of Search-R1, where multiple agents with different tools replace the single agent equipped with a search tool. While effective in practice, to me this looks to be quite incremental. 2. Another limitation is that distilling all capabilities into a single LL
- The paper presents a well-structured design combining multi-role reasoning and domain-specific tool agents under a single model. - Extensive benchmarks across search, reasoning, math, and coding show consistent improvements. - The paper provides substantial implementation details and appendices explaining the tool setup and data processing.
- The main technical contribution is the Chain-of-Agents architecture. The data synthesis pipeline and training (SFT + RL on distilled trajectories) mostly follows well-established practices. - The paper demonstrates performance gains but offers limited insight into why it outperforms explicit multi-agent systems or where the improvement come from, as well as why sometimes the performance difference is uneven, why in some dataset improvement is significant while in some cases it is marginal. Or
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
