Astra: Efficient and Money-saving Automatic Parallel Strategies Search   on Heterogeneous GPUs

Peiran Wang; Haibing Li; Fu Haohan; Shiyong Li; Yanpeng Wang; Dou Shen

arXiv:2502.13480·cs.DC·February 20, 2025

Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

Peiran Wang, Haibing Li, Fu Haohan, Shiyong Li, Yanpeng Wang, Dou Shen

PDF

Open Access

TL;DR

Astra is an innovative framework that automatically searches for efficient, cost-effective parallel strategies on heterogeneous GPUs, significantly improving throughput and reducing search time compared to manual methods.

Contribution

Astra introduces the first automatic parallel strategy search method that optimizes for efficiency and cost on heterogeneous GPUs, including a mathematical model for heterogeneous training time.

Findings

01

Astra achieves better throughput than expert-designed strategies.

02

Search time is limited to 1.27 seconds on single-GPU and 1.35 minutes on heterogeneous GPUs.

03

Over 95% accuracy in strategy search results.

Abstract

In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics