Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems
Dino Conciatore, Elia Oggian, Federico Da Forno, Stefano Schuppli, Jerome Tissieres, Joost VandeVondele, Maxime Martinasso

TL;DR
This paper explores the full AI lifecycle on HPC systems, proposing a hybrid cloud-native platform at Swiss National Supercomputing Centre to enable efficient fine-tuning and inference workflows.
Contribution
It introduces a novel Kubernetes-based architecture combining HPC and cloud resources for complete AI lifecycle management on supercomputers.
Findings
Hybrid platform improves user productivity in AI workflows
Analysis of trade-offs in fine-tuning pipelines and inference services
Blueprint for integrating AI services into supercomputing environments
Abstract
Large-scale pre-training of Foundational Models (FM) constitutes a computationally intensive first phase for enabling AI across diverse scientific and societal applications. This first phase has positioned High-Performance Computing (HPC) facilities as indispensable backbones of "Sovereign AI" initiatives. While the massive throughput requirements of FM pre-training align with the traditional capability-oriented mission of HPC, subsequent phases of the AI lifecycle, typically referred to as fine-tuning and inference, introduce operational paradigms that can conflict with established batch-processing environments. Moreover, these phases are not computationally trivial: they often require substantial high-end compute resources while exhibiting hardware utilization patterns that differ significantly from those of pre-training. This paper addresses the architectural and strategic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
