A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its   Dynamic Autotuning with Kernel Tuning Toolkit

Filip Petrovi\v{c}; David St\v{r}el\'ak; Jana Hozzov\'a; Jaroslav; O\v{l}ha; Richard Trembeck\'y; Siegfried Benkner; Ji\v{r}\'i Filipovi\v{c}

arXiv:1910.08498·cs.DC·March 2, 2020

A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

Filip Petrovi\v{c}, David St\v{r}el\'ak, Jana Hozzov\'a, Jaroslav, O\v{l}ha, Richard Trembeck\'y, Siegfried Benkner, Ji\v{r}\'i Filipovi\v{c}

PDF

1 Repo

TL;DR

This paper presents a benchmark set of autotunable CUDA and OpenCL kernels and demonstrates that dynamic autotuning during runtime can optimize performance across diverse hardware and input conditions, enhancing portability.

Contribution

It introduces a new benchmark suite of ten kernels and a dynamic autotuning approach integrated into the Kernel Tuning Toolkit for real-time performance optimization.

Findings

01

Most kernels reach near-peak performance with autotuning.

02

Dynamic tuning effectively adapts performance during application runtime.

03

Rationally designed tuning spaces enable feasible real-time autotuning.

Abstract

Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Fillo7/KTT
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.