In-Network Collective Operations: Game Changer or Challenge for AI Workloads?
Torsten Hoefler, Mikhail Khalilov, Josiah Clark, Surendra Anubolu, Mohan Kalkunte, Karen Schramm, Eric Spada, Duncan Roweth, Keith Underwood, Adrian Caulfield, Abdul Kabbani, Amirreza Rastegari

TL;DR
This paper explores in-network collective operations (INC) for AI workloads, discussing their potential benefits, challenges, and future prospects, bridging AI and networking communities.
Contribution
It provides a comprehensive overview of INC types, details their advantages and obstacles, and offers predictions for future development in AI and networking integration.
Findings
INC can significantly accelerate AI workloads.
Six key obstacles may hinder INC adoption.
Future INC development depends on overcoming these challenges.
Abstract
This paper summarizes the opportunities of in-network collective operations (INC) for accelerated collective operations in AI workloads. We provide sufficient detail to make this important field accessible to non-experts in AI or networking, fostering a connection between these communities. Consider two types of INC: Edge-INC, where the system is implemented at the node level, and Core-INC, where the system is embedded within network switches. We outline the potential performance benefits as well as six key obstacles in the context of both Edge-INC and Core-INC that may hinder their adoption. Finally, we present a set of predictions for the future development and application of INC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · IoT and Edge/Fog Computing · Cloud Computing and Resource Management
