OptiNIC: A Resilient and Tail-Optimal RDMA NIC for Distributed ML Workloads
Ertza Warraich, Ali Imran, Annus Zulfiqar, Shay Vargaftik, Sonia Fahmy, Muhammad Shahbaz

TL;DR
OptiNIC introduces a domain-specific RDMA transport that removes retransmissions and in-order guarantees, reducing latency and improving throughput for distributed ML workloads by leveraging ML's tolerance for data loss.
Contribution
It proposes a novel out-of-order, best-effort RDMA transport tailored for ML, shifting loss recovery to the ML pipeline and eliminating traditional retransmission mechanisms.
Findings
2x faster time-to-accuracy in training
1.6x higher throughput in inference
3.5x lower tail latency
Abstract
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by high-speed interconnects, tail latency in collective communication has become a major bottleneck. Existing RDMA transports, such as RoCE, IRN, SRNIC, and Falcon, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While these approaches work well for general-purpose workloads, they introduce complexity and latency that scale poorly in ML, where even rare packet delays can stall entire model pipelines. We present OptiNIC, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for partial or missing data. OptiNIC eliminates retransmissions and in-order delivery from the NIC, enabling a best-effort, out-of-order transport model for RDMA. Unlike traditional RDMA, which signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Parallel Computing and Optimization Techniques
