DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication
Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Ryan Quach, Saleel Kudchadker, Mohamed Assem Ibrahim

TL;DR
This paper extends DMA offload techniques on AMD Instinct MI300X GPUs from bandwidth-bound to latency-bound ML communication scenarios, achieving significant performance improvements.
Contribution
It introduces novel features in AMD MI300X GPUs that enable DMA offloads to be effective for latency-sensitive ML workloads, with demonstrated operator-level and workload-level acceleration.
Findings
DMA offloads close up to 4.5× performance gap in ML collectives.
Power savings of 3-10% in ML collectives using DMA offloads.
Up to 1.5× lower latency and 1.9× higher throughput in LLM inference.
Abstract
Offloading communication to existing direct memory access (DMA) engines, available on most state-of-the-art commercial GPUs, has emerged as an interesting and low-cost solution to efficiently overlap computation and communication in machine learning (ML). That said, so far, the reach of DMA offloads has been limited to bandwidth-bound scenarios only (10s of MB to GB transfer sizes). In this work, we aim to break this barrier and expand the reach of DMA communication offloads to even latency-bound regions (KB to low MB). Specifically, we discuss in this work hitherto untapped features available in the state-of-the-art AMD Instinct MI300X GPUs that render DMA communication offloads competitive even for latency-bound regions. We demonstrate the efficacy of these features at the operator-level (ML communication collectives such as all-gather and all-to-all), and also at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
