DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

Suchita Pati; Shaizeen Aga; Mahzabeen Islam; Ryan Quach; Saleel Kudchadker; Mohamed Assem Ibrahim

arXiv:2511.06605·cs.DC·April 13, 2026

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Ryan Quach, Saleel Kudchadker, Mohamed Assem Ibrahim

PDF

TL;DR

This paper extends DMA offload techniques on AMD Instinct MI300X GPUs from bandwidth-bound to latency-bound ML communication scenarios, achieving significant performance improvements.

Contribution

It introduces novel features in AMD MI300X GPUs that enable DMA offloads to be effective for latency-sensitive ML workloads, with demonstrated operator-level and workload-level acceleration.

Findings

01

DMA offloads close up to 4.5× performance gap in ML collectives.

02

Power savings of 3-10% in ML collectives using DMA offloads.

03

Up to 1.5× lower latency and 1.9× higher throughput in LLM inference.

Abstract

Offloading communication to existing direct memory access (DMA) engines, available on most state-of-the-art commercial GPUs, has emerged as an interesting and low-cost solution to efficiently overlap computation and communication in machine learning (ML). That said, so far, the reach of DMA offloads has been limited to bandwidth-bound scenarios only (10s of MB to GB transfer sizes). In this work, we aim to break this barrier and expand the reach of DMA communication offloads to even latency-bound regions (KB to low MB). Specifically, we discuss in this work hitherto untapped features available in the state-of-the-art AMD Instinct $^{TM}$ MI300X GPUs that render DMA communication offloads competitive even for latency-bound regions. We demonstrate the efficacy of these features at the operator-level (ML communication collectives such as all-gather and all-to-all), and also at the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.