ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Yuran Ding; Xinwei Chen; Xiaofan Zhang; Zongwei Zhou

arXiv:2511.03844·cs.MA·November 7, 2025

ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Yuran Ding, Xinwei Chen, Xiaofan Zhang, Zongwei Zhou

PDF

Open Access

TL;DR

ASAP is a multi-agent system that automates performance optimization in large-scale LLM training, reducing training time and increasing throughput by integrating reasoning, profiling, and expert knowledge.

Contribution

It introduces a novel multi-agent framework that automates diagnosis and optimization of LLM training performance, combining LLM reasoning with profiling insights and expert knowledge.

Findings

01

Up to 28% reduction in training step time.

02

1.43 times increase in throughput with ASAP.

03

Further throughput increase to 2.58 times with human expert input.

Abstract

Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification