Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Quentin Anthony; Yury Tokpanov; Skyler Szot; Srivatsan Rajagopal; Praneeth Medepalli; Anna Golubeva; Vasu Shyam; Robert Washbourne; Rishi Iyer; Ansh Chaurasia; Tomas Figliolia; Xiao Yang; Abhinav Sarje; Drew Thorstensen; Amartey Pearson; Zack Grossbart; Jason van Patten; Emad Barsoum; Zhenyu Gu; Yao Fu; Beren Millidge

arXiv:2511.17127·cs.CL·December 5, 2025

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Anna Golubeva, Vasu Shyam, Robert Washbourne, Rishi Iyer, Ansh Chaurasia, Tomas Figliolia, Xiao Yang, Abhinav Sarje, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten

PDF

Open Access 2 Models

TL;DR

This paper presents the first large-scale mixture-of-experts pretraining on AMD hardware, providing detailed system benchmarks, model design guidance, and a new 760M parameter MoE model called ZAYA1 that performs competitively on various benchmarks.

Contribution

It offers comprehensive system characterization, microbenchmark data, and MI300X-aware transformer sizing rules, along with the development of the ZAYA1 MoE model for large-scale pretraining on AMD hardware.

Findings

01

ZAYA1 achieves performance comparable to leading models.

02

AMD hardware and software stack are mature for large-scale training.

03

Detailed system benchmarks inform model and system design choices.

Abstract

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques