Benchmarking the cost of thread divergence in CUDA
Piotr Bialas, Adam Strzelecki

TL;DR
This paper introduces a micro-benchmark to measure the performance costs of thread divergence in CUDA's SIMT model across various architectures, aiding understanding of vectorization inefficiencies.
Contribution
It provides a novel benchmarking tool to quantify thread divergence costs in CUDA, which was previously not systematically measured.
Findings
Divergence costs vary significantly across architectures.
The micro-benchmark effectively measures divergence impact on loop performance.
Results help optimize CUDA code for better performance.
Abstract
All modern processors include a set of vector instructions. While this gives a tremendous boost to the performance, it requires a vectorized code that can take advantage of such instructions. As an ideal vectorization is hard to achieve in practice, one has to decide when different instructions may be applied to different elements of the vector operand. This is especially important in implicit vectorization as in NVIDIA CUDA Single Instruction Multiple Threads (SIMT) model, where the vectorization details are hidden from the programmer. In order to assess the costs incurred by incompletely vectorized code, we have developed a micro-benchmark that measures the characteristics of the CUDA thread divergence model on different architectures focusing on the loops performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Data Storage Technologies
