OptiNIC: A Resilient and Tail-Optimal RDMA NIC for Distributed ML Workloads

Ertza Warraich; Ali Imran; Annus Zulfiqar; Shay Vargaftik; Sonia Fahmy; Muhammad Shahbaz

arXiv:2512.22743·cs.DC·December 30, 2025

OptiNIC: A Resilient and Tail-Optimal RDMA NIC for Distributed ML Workloads

Ertza Warraich, Ali Imran, Annus Zulfiqar, Shay Vargaftik, Sonia Fahmy, Muhammad Shahbaz

PDF

Open Access

TL;DR

OptiNIC introduces a domain-specific RDMA transport that removes retransmissions and in-order guarantees, reducing latency and improving throughput for distributed ML workloads by leveraging ML's tolerance for data loss.

Contribution

It proposes a novel out-of-order, best-effort RDMA transport tailored for ML, shifting loss recovery to the ML pipeline and eliminating traditional retransmission mechanisms.

Findings

01

2x faster time-to-accuracy in training

02

1.6x higher throughput in inference

03

3.5x lower tail latency

Abstract

As distributed machine learning (ML) workloads scale to thousands of GPUs connected by high-speed interconnects, tail latency in collective communication has become a major bottleneck. Existing RDMA transports, such as RoCE, IRN, SRNIC, and Falcon, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While these approaches work well for general-purpose workloads, they introduce complexity and latency that scale poorly in ML, where even rare packet delays can stall entire model pipelines. We present OptiNIC, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for partial or missing data. OptiNIC eliminates retransmissions and in-order delivery from the NIC, enabling a best-effort, out-of-order transport model for RDMA. Unlike traditional RDMA, which signals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Parallel Computing and Optimization Techniques