ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training
Yuran Ding, Xinwei Chen, Xiaofan Zhang, Zongwei Zhou

TL;DR
ASAP is a multi-agent system that automates performance optimization in large-scale LLM training, reducing training time and increasing throughput by integrating reasoning, profiling, and expert knowledge.
Contribution
It introduces a novel multi-agent framework that automates diagnosis and optimization of LLM training performance, combining LLM reasoning with profiling insights and expert knowledge.
Findings
Up to 28% reduction in training step time.
1.43 times increase in throughput with ASAP.
Further throughput increase to 2.58 times with human expert input.
Abstract
Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
