Performance portability through machine learning guided kernel selection in SYCL libraries
John Lawson

TL;DR
This paper presents a machine learning approach to select and deploy a limited set of optimized kernels in SYCL libraries, enabling performance portability across diverse hardware and input scenarios without manual tuning.
Contribution
It introduces an automated method using clustering and classification to choose kernel subsets for deployment, reducing library size and tuning effort.
Findings
Unsupervised clustering effectively identifies kernel subsets for deployment.
Simple classifiers can select the best kernel at runtime for diverse inputs.
The approach requires no developer intervention, only benchmark data.
Abstract
Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parameters provided by a user, and so these techniques are of limited use. Additionally, parallel programming frameworks such as SYCL require that the kernels be deployed in a binary format embedded within the library. As such it is impractical to deploy a large number of possible kernel configurations without inflating the library size. Machine learning methods can be used to mitigate against both of these problems and provide performance for general purpose routines with a limited number of kernel configurations. We show that unsupervised clustering methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
