A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs
Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, Satoshi Matsuoka

TL;DR
This paper analyzes the performance and characteristics of synchronization methods in Nvidia GPUs, providing insights for optimizing single and multi-GPU applications.
Contribution
It offers an in-depth analysis of undocumented features and performance considerations of Nvidia GPU synchronization methods, aiding better design choices.
Findings
Identifies key performance pitfalls of synchronization methods
Provides micro-benchmarks for measuring synchronization performance
Case study on reduction operator illustrates practical implications
Abstract
GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of those synchronization methods. This work explores important undocumented features and provides an in-depth analysis of the performance considerations and pitfalls of the state-of-art synchronization methods for Nvidia GPUs. The provided analysis would be useful when making design choices for applications, libraries, and frameworks running on single and/or multi-GPU environments. We provide a case study of the commonly used reduction operator to illustrate how the knowledge gained in our analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Interconnection Networks and Systems
