Performance Portability Study of Linear Algebra Kernels in OpenCL
Karl Rupp, Philippe Tillet, Florian Rudolf, Josef Weinbub and, Tibor Grasser, Ansgar J\"ungel

TL;DR
This paper investigates how well OpenCL linear algebra kernels perform across various hardware and vendors, showing that optimized kernels can achieve good portability and performance consistency.
Contribution
It demonstrates that optimizing a single kernel can often ensure good performance across multiple hardware generations and vendors, simplifying performance tuning.
Findings
Certain kernel and work size combinations perform well across hardware and vendors.
Optimizing one kernel can lead to good performance for complex operations.
Performance portability varies with kernel implementation and hardware.
Abstract
The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations of kernel implementations and work sizes are found to exhibit good performance across compute kernels, hardware generations, and, to a lesser degree, vendors. As a consequence, it is demonstrated that the optimization of a single kernel is often sufficient to obtain good performance for a large class of more complicated operations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
