Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing
Logan Ward, Ganesh Sivaraman, J. Gregory Pauloski, Yadu Babuji, Ryan, Chard, Naveen Dandu, Paul C. Redfern, Rajeev S. Assary, Kyle Chard, Larry A., Curtiss, Rajeev Thakur, Ian Foster

TL;DR
Colmena is an open-source Python framework that enables scalable, machine-learning-guided steering of ensemble simulations on HPC systems, significantly accelerating scientific discovery processes.
Contribution
It introduces a flexible, scalable framework that integrates ML models with ensemble simulations, simplifying deployment and coordination on high-performance computing resources.
Findings
Scales to 65,536 CPUs on HPC systems.
Accelerates electrolyte discovery by 100x.
Demonstrates effective ML-guided ensemble steering.
Abstract
Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. Colmena handles task dispatch, results collation, ML model invocation, and ML model (re)training, using Parsl to execute tasks on HPC systems. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to 65536 CPUs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Parallel Computing and Optimization Techniques
