Monitoring Collective Communication Among GPUs
Muhammet Abdullah Soyturk, Palwisha Akhtar, Erhan Tezcan, Didem Unat

TL;DR
This paper introduces an extension of the ComScribe tool to detect and visualize collective and P2P GPU communications in NVIDIA's NCCL library, aiding performance analysis in HPC and AI workloads.
Contribution
It extends prior GPU communication detection tools to include collective operations, providing detailed data transfer metrics and visualizations for better performance tuning.
Findings
Successfully identified collective and P2P communications in GPU applications.
Generated communication matrices revealing data transfer patterns.
Demonstrated the tool on machine translation and image classification applications.
Abstract
Communication among devices in multi-GPU systems plays an important role in terms of performance and scalability. In order to optimize an application, programmers need to know the type and amount of the communication happening among GPUs. Although there are prior works to gather this information in MPI applications on distributed systems and multi-threaded applications on shared memory systems, there is no tool that identifies communication among GPUs. Our prior work, ComScribe, presents a point-to-point (P2P) communication detection tool for GPUs sharing a common host. In this work, we extend ComScribe to identify communication among GPUs for collective and P2P communication primitives in NVIDIA's NCCL library. In addition to P2P communications, collective communications are commonly used in HPC and AI workloads thus it is important to monitor the induced data movement due to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques
