Benchmarking the cost of thread divergence in CUDA

Piotr Bialas; Adam Strzelecki

arXiv:1504.01650·cs.DC·April 8, 2015·1 cites

Benchmarking the cost of thread divergence in CUDA

Piotr Bialas, Adam Strzelecki

PDF

Open Access

TL;DR

This paper introduces a micro-benchmark to measure the performance costs of thread divergence in CUDA's SIMT model across various architectures, aiding understanding of vectorization inefficiencies.

Contribution

It provides a novel benchmarking tool to quantify thread divergence costs in CUDA, which was previously not systematically measured.

Findings

01

Divergence costs vary significantly across architectures.

02

The micro-benchmark effectively measures divergence impact on loop performance.

03

Results help optimize CUDA code for better performance.

Abstract

All modern processors include a set of vector instructions. While this gives a tremendous boost to the performance, it requires a vectorized code that can take advantage of such instructions. As an ideal vectorization is hard to achieve in practice, one has to decide when different instructions may be applied to different elements of the vector operand. This is especially important in implicit vectorization as in NVIDIA CUDA Single Instruction Multiple Threads (SIMT) model, where the vectorization details are hidden from the programmer. In order to assess the costs incurred by incompletely vectorized code, we have developed a micro-benchmark that measures the characteristics of the CUDA thread divergence model on different architectures focusing on the loops performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Data Storage Technologies