I've Got 99 Problems But FLOPS Ain't One
Alexandru M. Gherghescu, Vlad-Andrei B\u{a}doiu, Alexandru Agache,, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, Costin Raiciu

TL;DR
This paper explores the challenges in building and operating large-scale datacenters for machine learning, emphasizing the need for novel networking solutions to support the massive data and communication demands.
Contribution
It proposes a research agenda focused on developing new wide-area and intra-datacenter networking technologies for large ML-focused datacenters based on analysis of public plans and scaling laws.
Findings
Building such datacenters is feasible with current technology
New networking architectures are needed for efficient inter- and intra-datacenter communication
Research directions include novel transport protocols and datacenter topologies
Abstract
Hyperscalers dominate the landscape of large network deployments, yet they rarely share data or insights about the challenges they face. In light of this supremacy, what problems can we find to solve in this space? We take an unconventional approach to find relevant research directions, starting from public plans to build a $100 billion datacenter for machine learning applications. Leveraging the language models scaling laws, we discover what workloads such a datacenter might carry and explore the challenges one may encounter in doing so, with a focus on networking research. We conclude that building the datacenter and training such models is technically possible, but this requires novel wide-area transports for inter-DC communication, a multipath transport and novel datacenter topologies for intra-datacenter communication, high speed scale-up networks and transports, outlining a rich…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus
