EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet
Yitao Yuan, Jianglong Nie, Tianyu Bai, Ruizhe Zhou, Siyuan Cao, Xujie Fan, Yuchen Xu, Junkai Chen, Chenqi Zhao, Nengyuan Zhang, Shaoke Fang, Jiangyuan Chen, Yuanfeng Chen, Jiaqi Sun, Zhan Wang, Xiaohua Xu, Yuchao Zhang, Yang Liu, Xiangrui Yang, Jing Lin, Xiaohe Hu, Yang Li

TL;DR
EPIC introduces a standardized Ethernet-based abstraction and polymorphic implementation for in-network collective acceleration, enhancing AI training efficiency and hardware adaptability.
Contribution
It presents a novel Ethernet-compatible abstraction and modular, verified polymorphic realizations for in-network collectives, facilitating incremental hardware development and broad adoption.
Findings
Validated correctness through formal verification and extensive testing.
Achieved performance improvements in various simulation and hardware environments.
Demonstrated feasibility of the EPIC approach across multiple platforms.
Abstract
In-Network Collective (INC) acceleration holds immense potential for optimizing AI training and inference; however, its cross-layer nature has historically hindered investment and adoption within the open Ethernet ecosystem. To bridge this gap, we propose EPIC (Ethernet Polymorphic In-network Collective), an INC protocol specification and reference system built on the principle of "Unified Abstraction, Polymorphic Realization." EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. We address three fundamental challenges: first, we employ a modular design that enables an evolutionary path from simple to complex implementations, allowing vendors to iterate their hardware incrementally; second, we apply formal verification methodologies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
