An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience
Jonathan Coles, Stefano Schuppli, Lukas Drescher, Fawzi Roberto Mohamed, Elia Palme, Henrique Mendon\c{c}a, Miguel Gila, Mark Klein, Maxime Martinasso, Joost VandeVondele, Torsten Hoefler, Thomas Schulthess, Josh Romero, Igor Gorodetsky, Ryan Hankins, Isa Wazirzada, Martin Jaggi

TL;DR
This paper describes the engineering process of training a large, multilingual AI model on Europe's Alps supercomputer, highlighting infrastructure challenges and solutions for academic-scale LLM training.
Contribution
It presents the first successful training of a 70B parameter LLM in academia using HPC infrastructure, detailing technical challenges and platform evolution.
Findings
Achieved training of a 70B parameter LLM on Alps supercomputer.
Identified and addressed HPC infrastructure challenges for AI training.
Developed a resilient, scalable ML platform for ongoing model fine-tuning.
Abstract
Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
