Communication Offloading on SmartNIC DPUs: A Quantitative Approach
Jacob Wahlgren, Andong Hu, Roger Pearce, Maya Gokhale, Ivy Peng

TL;DR
This paper evaluates the feasibility and performance of offloading communication tasks to SmartNIC DPUs, demonstrating speedups and identifying key bottlenecks like increased DRAM traffic.
Contribution
It introduces Buddy, a communication offloading engine for SmartNIC DPUs, and provides a quantitative analysis of its performance and challenges.
Findings
Up to 1.55x speedup in host-dominated workloads with offloading.
Memory-to-communication ratio predicts offloading performance.
625x increase in DRAM traffic due to lack of Direct Cache Access.
Abstract
SmartNIC Data Processing Units (DPUs) offer a promising solution for saving high-end CPU resources by offloading tasks to programmable cores near the network interface. In this work, we explore the feasibility of SmartNIC DPUs in supporting an asynchronous communication model called "fire-and-forget", particularly its core message routing service. We design a communication offloading engine called Buddy that decouples communication tasks from the application process. Buddy runs flexibly on SmartNIC DPUs such as the Nvidia BlueField-3 DPU and generic x86 CPUs. Our evaluation results in five applications identify the memory-to-communication ratio as a key predictor of the offloading performance. Host-dominated workloads, such as Quicksilver and Sparse Matrix Transpose, achieved up to 1.55x speedup with communication offloaded to the DPU. We further identify a 625x increase in DRAM traffic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
