Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis
Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William, Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew, Hubbell, Michael Jones, Anna Klein, Lauren Milechin, Julia Mullen, Andrew, Prout, Antonio Rosa, Charles Yee, Peter Michaleas

TL;DR
This paper presents the development of an interactive supercomputing system capable of launching thousands of machine learning and data analysis tasks within seconds on a 40,000-core supercomputer, enabling rapid experimentation.
Contribution
It introduces techniques for scaling interactive frameworks like TensorFlow and MATLAB/Octave to tens of thousands of cores with minimal latency.
Findings
32,000 TensorFlow processes launched in 4 seconds
262,000 Octave processes launched in 40 seconds
Enables rapid exploration of machine learning architectures
Abstract
Interactive massively parallel computations are critical for machine learning and data analysis. These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges - in particular, rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with thousands of dependencies. Careful tuning of launches and prepositioning of applications overcome these challenges and allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer. Specifically, this work demonstrates launching 32,000 TensorFlow processes in 4 seconds and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
