Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

Dharma Teja Vooturi; Dhiraj Kalamkar; Dipankar Das; Bharat Kaul

arXiv:2604.00785·cs.LG·April 2, 2026

Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

Dharma Teja Vooturi, Dhiraj Kalamkar, Dipankar Das, Bharat Kaul

PDF

TL;DR

This paper demonstrates large-scale pretraining of Mixture of Experts language models on Aurora supercomputer, introducing Optimus library, custom GPU kernels, and achieving high efficiency and scalability.

Contribution

Developed Optimus, a training library supporting large MoE models, enabling efficient, scalable pretraining on Aurora at unprecedented GPU scale.

Findings

01

Pretrained models up to 220B parameters on 12288 GPU tiles.

02

Achieved around 90% scaling efficiency at 12288 GPU tiles.

03

Speedups of up to 1.71x using custom GPU kernels and novel optimizers.

Abstract

Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.