Optimizing ML Concurrent Computation and Communication with GPU DMA Engines
Anirudha Agrawal, Shaizeen Aga, Suchita Pati, Mahzabeen Islam

TL;DR
This paper investigates optimizing concurrent computation and communication (C3) on GPUs for machine learning, demonstrating that leveraging DMA engines significantly improves performance and approaches ideal speedups.
Contribution
It introduces heuristics for scheduling and resource partitioning, and proposes using GPU DMA engines for communication to enhance C3 performance.
Findings
C3 achieves only 21% of ideal speedup without optimization.
Scheduling and resource partitioning improve C3 to 42% of ideal speedup.
Using DMA engines with ConCCL increases C3 speedup to 72% of ideal, up to 1.67x faster.
Abstract
Concurrent computation and communication (C3) is a pervasive paradigm in ML and other domains, making its performance optimization crucial. In this paper, we carefully characterize C3 in ML on GPUs, which are most widely deployed for ML training and inference. We observe that while C3 leads to performance uplifts, the uplifts are far lower than ideal speedups (serial computation and communication versus maximum of computation or communication; all times from isolated executions). That is, C3 on average achieves only 21% of ideal speedup. This is so, due to known challenges of compute and memory interference between concurrent GPU kernels (that is, sharing of GPU's compute units, caches and HBM). To attain better performance for C3, first, we evaluate dual strategies of schedule prioritization and careful resource partitioning of compute units on GPUs to push performance attained with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
