Towards a Science of Scaling Agent Systems

Yubin Kim; Ken Gu; Chanwoo Park; Chunjong Park; Samuel Schmidgall; A. Ali Heydari; Yao Yan; Zhihan Zhang; Yuchen Zhuang; Yun Liu; Mark Malhotra; Paul Pu Liang; Hae Won Park; Yuzhe Yang; Xuhai Xu; Yilun Du; Shwetak Patel; Tim Althoff; Daniel McDuff; Xin Liu

arXiv:2512.08296·cs.AI·April 10, 2026

Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, Xin Liu

PDF

TL;DR

This paper develops a predictive model for how agent system performance scales with various factors, revealing key patterns and optimal architectures across multiple benchmarks and configurations.

Contribution

It introduces quantitative scaling principles for agent systems, validated across diverse architectures, benchmarks, and models, to understand performance dynamics and guide system design.

Findings

01

Performance saturates with increased coordination beyond a point.

02

Tool-heavy tasks may experience overhead in multi-agent setups.

03

Architectures without centralized verification propagate errors more.

Abstract

Agents, language model-based systems capable of reasoning, planning, and acting are widely adopted in real-world tasks, yet how their performance changes as these systems scale across key dimensions remains underexplored. We introduce quantitative scaling principles for agent systems as a predictive model, capturing how performance varies with coordination, model capability, and measurable system and task factors. Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single-Agent and four Multi-Agent: Independent, Centralized, Decentralized, Hybrid), and three LLM families, we perform controlled evaluations, standardizing tools, prompts, and compute to isolate architectural effects. The resulting model achieves a cross-validated R^2=0.373 across all six benchmarks (R^2=0.413 with a task-grounded capability metric). We identify a robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.