Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision
Nathaniel Hudson, J. Gregory Pauloski, Matt Baughman, Alok Kamatar,, Mansi Sakarvadia, Logan Ward, Ryan Chard, Andr\'e Bauer, Maksim Levental,, Wenyi Wang, Will Engler, Owen Price Skelly, Ben Blaiszik, Rick Stevens, Kyle, Chard, Ian Foster

TL;DR
This paper surveys the development of trillion-parameter AI models, focusing on the infrastructure needed to serve these models for scientific discovery, and discusses technical challenges and future directions.
Contribution
It provides a comprehensive vision and identifies key technical challenges for building AI serving infrastructure tailored to trillion-parameter models in scientific research.
Findings
Identifies critical system design challenges for TPM deployment.
Proposes a software stack to support flexible scientific research needs.
Highlights open problems in serving large-scale AI models.
Abstract
Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters -- such as Huawei's PanGu-. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Big Data and Business Intelligence · IoT and Edge/Fog Computing
