Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

TL;DR
This paper introduces the agent data protocol (ADP), a unified data format that consolidates diverse agent training datasets, enabling more effective fine-tuning of large language model agents and improving performance across multiple benchmarks.
Contribution
The paper presents ADP, a lightweight, expressive data representation language that unifies heterogeneous datasets for agent training, facilitating scalable and reproducible fine-tuning of LLM agents.
Findings
Unified 13 datasets into ADP format
Achieved ~20% performance improvement on base models
Attained state-of-the-art results on multiple benchmarks
Abstract
Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing…
Peer Reviews
Decision·ICLR 2026 Oral
1. **Standardized and Extensible Data Schema** The paper introduces a well-structured, unified schema that captures essential components of agent-based interactions (tasks, agents, trajectories, and scores), addressing long-standing fragmentation in agent data representation. 2. **Practical Utility Across Diverse Agent Systems** ADP demonstrates strong real-world applicability by enabling seamless data sharing and transformation across different platforms, toolchains, and evaluation p
See questions
The paper is well written and easy to follow. Having a data standard that can help various agent datasets into a single format would help the research community in this area to a great extent and can help with the reusability of the assets with ease. Adopting existing 13 benchmark datasets to the format and open-sourcing them for the community Analysis of the various datasets after the conversion and fine-tuning results to show the power of having a standardized data format and the kind of gene
Can authors comment on the SOTA numbers for various tasks with similarly sized models? I can see improvements for a selected model from its base performance. Are there any fune-tuned models in a similar parameter range that get better numbers than what is reported here with ADP data?
1. The paper presents an excellent motivation, keenly identifying data fragmentation as a key engineering bottleneck in current agent research and providing a clear direction for future data standardization efforts. 2. It provides a valuable contribution to the open-source community by integrating 13 diverse datasets into a unified ADP format and demonstrating the value of mixed data, thereby establishing a solid data foundation for building general agent capabilities. 3. The experiments are c
1. The experiments involve an unfair comparison. While the paper emphasizes the importance of unifying agent fine-tuning data formats through ADP, the comparative experiments do not use an equal amount of ADP and non-unified data. Due to the inconsistency in data scale, it is difficult to convincingly demonstrate the advantages of ADP over other data formats, or to support the claim that mixed data is superior to single-task data. 2. The experiments are incomplete. Although the paper selects Qw
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
