CosmoFlow: Using Deep Learning to Learn the Universe at Scale
Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows,, James Arnemann, Lei Shao, Siyu He, Tuomas Karna, Daina Moise, Simon J., Pennycook, Kristyn Maschoff, Jason Sewall, Nalini Kumar, Shirley Ho, Mike, Ringenburg, Prabhat, Victor Lee

TL;DR
CosmoFlow leverages scalable deep learning on supercomputers to analyze large-scale cosmological data, achieving high efficiency and accuracy in predicting universe parameters.
Contribution
This work introduces a highly scalable TensorFlow-based deep learning application for cosmology, optimized for supercomputing environments and demonstrating unprecedented performance.
Findings
Achieved 3.5 Pflop/s performance on 8192 nodes
Demonstrated fully synchronous training with 77% efficiency
Predicted cosmological parameters with unprecedented accuracy
Abstract
Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading for many element-wise operations, to improve training performance on Intel(C) Xeon Phi(TM) processors. We also utilize the Cray PE Machine Learning Plugin for efficient scaling to multiple nodes. We demonstrate fully synchronous data-parallel training on 8192 nodes of Cori with 77% parallel efficiency, achieving 3.5 Pflop/s sustained performance. To our knowledge, this is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully-synchronous training. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
Methods3D Convolution · Convolution
