Optimizing ML Concurrent Computation and Communication with GPU DMA   Engines

Anirudha Agrawal; Shaizeen Aga; Suchita Pati; Mahzabeen Islam

arXiv:2412.14335·cs.AR·April 28, 2025

Optimizing ML Concurrent Computation and Communication with GPU DMA Engines

Anirudha Agrawal, Shaizeen Aga, Suchita Pati, Mahzabeen Islam

PDF

Open Access

TL;DR

This paper investigates optimizing concurrent computation and communication (C3) on GPUs for machine learning, demonstrating that leveraging DMA engines significantly improves performance and approaches ideal speedups.

Contribution

It introduces heuristics for scheduling and resource partitioning, and proposes using GPU DMA engines for communication to enhance C3 performance.

Findings

01

C3 achieves only 21% of ideal speedup without optimization.

02

Scheduling and resource partitioning improve C3 to 42% of ideal speedup.

03

Using DMA engines with ConCCL increases C3 speedup to 72% of ideal, up to 1.67x faster.

Abstract

Concurrent computation and communication (C3) is a pervasive paradigm in ML and other domains, making its performance optimization crucial. In this paper, we carefully characterize C3 in ML on GPUs, which are most widely deployed for ML training and inference. We observe that while C3 leads to performance uplifts, the uplifts are far lower than ideal speedups (serial computation and communication versus maximum of computation or communication; all times from isolated executions). That is, C3 on average achieves only 21% of ideal speedup. This is so, due to known challenges of compute and memory interference between concurrent GPU kernels (that is, sharing of GPU's compute units, caches and HBM). To attain better performance for C3, first, we evaluate dual strategies of schedule prioritization and careful resource partitioning of compute units on GPUs to push performance attained with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques