An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

Jonathan Coles; Stefano Schuppli; Lukas Drescher; Fawzi Roberto Mohamed; Elia Palme; Henrique Mendon\c{c}a; Miguel Gila; Mark Klein; Maxime Martinasso; Joost VandeVondele; Torsten Hoefler; Thomas Schulthess; Josh Romero; Igor Gorodetsky; Ryan Hankins; Isa Wazirzada; Martin Jaggi; Antoine Bosselut; Imanol Schlag; Antoni-Joan Solergibert i Llaquet; Alejandro Hern\'andez Cano; Theofilos Ioannis Manitaras; Nicholas John Browning

arXiv:2604.12973·cs.DC·April 16, 2026

An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

Jonathan Coles, Stefano Schuppli, Lukas Drescher, Fawzi Roberto Mohamed, Elia Palme, Henrique Mendon\c{c}a, Miguel Gila, Mark Klein, Maxime Martinasso, Joost VandeVondele, Torsten Hoefler, Thomas Schulthess, Josh Romero, Igor Gorodetsky, Ryan Hankins, Isa Wazirzada, Martin Jaggi

PDF

TL;DR

This paper describes the engineering process of training a large, multilingual AI model on Europe's Alps supercomputer, highlighting infrastructure challenges and solutions for academic-scale LLM training.

Contribution

It presents the first successful training of a 70B parameter LLM in academia using HPC infrastructure, detailing technical challenges and platform evolution.

Findings

01

Achieved training of a 70B parameter LLM on Alps supercomputer.

02

Identified and addressed HPC infrastructure challenges for AI training.

03

Developed a resilient, scalable ML platform for ongoing model fine-tuning.

Abstract

Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.