TL;DR
This paper presents a benchmark set of autotunable CUDA and OpenCL kernels and demonstrates that dynamic autotuning during runtime can optimize performance across diverse hardware and input conditions, enhancing portability.
Contribution
It introduces a new benchmark suite of ten kernels and a dynamic autotuning approach integrated into the Kernel Tuning Toolkit for real-time performance optimization.
Findings
Most kernels reach near-peak performance with autotuning.
Dynamic tuning effectively adapts performance during application runtime.
Rationally designed tuning spaces enable feasible real-time autotuning.
Abstract
Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
