MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era
Lei Zhang, Mouxiang Chen, Ruisheng Cao, Jiawei Chen, Fan Zhou, Yiheng Xu, Jiaxi Yang, Zeyao Ma, Liang Chen, Changwei Luo, Kai Zhang, Fan Yan, KaShun Shum, Jiajun Zhang, Zeyu Cui, Feng Hu, Junyang Lin, Binyuan Hui, Min Yang

TL;DR
MegaFlow is a scalable distributed system designed to efficiently orchestrate large-scale agent training and evaluation, addressing infrastructure gaps in the emerging agentic AI era.
Contribution
It introduces a novel architecture that separates services for model, agent, and environment management, enabling flexible scaling and resource allocation for complex agentic tasks.
Findings
Orchestrates tens of thousands of concurrent agent tasks
Maintains high system stability during large-scale operations
Achieves efficient resource utilization in agent training
Abstract
The rapid development of interactive and autonomous AI systems signals our entry into the agentic era. Training and evaluating agents on complex agentic tasks such as software engineering and computer use requires not only efficient model computation but also sophisticated infrastructure capable of coordinating vast agent-environment interactions. However, no open-source infrastructure can effectively support large-scale training and evaluation on such complex agentic tasks. To address this challenge, we present MegaFlow, a large-scale distributed orchestration system that enables efficient scheduling, resource allocation, and fine-grained task management for agent-environment workloads. MegaFlow abstracts agent training infrastructure into three independent services (Model Service, Agent Service, and Environment Service) that interact through unified interfaces, enabling independent…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The three-service modularization yields a clean separation of concerns that simplifies independent scaling and maintenance. 2. The evaluation dataset is substantial (though I have no idea what the dataset essentially is), providing a degree of empirical credibility rarely seen in infrastructure proposals.
1. Figure 1 contains limited information; I suggest either reducing its space allocation or adding more explanatory details to enhance clarity. 2. In Line 215, what exactly are the “complex resource monitoring and allocation algorithms”? Likewise, what does the “standardized compute instance” implemented by the authors refer to? In Line 246, more details are needed regarding the document database—specifically, the structure of the operational metadata, its storage format, and how it is managed a
- This work addresses a significant challenge regarding scaling up data collection for agentic LLM training. - The results and analysis are comprehensive from a systems perspective. - The text is well-written and easy to follow.
- There are no results regarding the downstream utility of MegaFlow in the context of LLM training. I recognize that this is more of an infrastructure/systems paper, but any downstream results would've been appreciated. - The CPU utilization and memory utilization is consistent but still low.
1. The paper articulates key system-level challenges in scaling interactive agent training, differentiating this setting from traditional large-model training workloads. 2. The three-service architecture is well-structured and clearly explained.
I am not an expert in agent orchestration, so please correct my mistakes in my questions: 1. Comparisons seem to be mainly against high-spec centralized machines rather than alternative distributed or hybrid systems. 2. While the modular three-service abstraction is intuitive, the paper would benefit from clearer articulation of which components introduce fundamentally new design ideas versus mature cloud-native practices adapted to the agent training context. 3. While the system enables large-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Software System Performance and Reliability · Scientific Computing and Data Management
