LS-CAT: A Large-Scale CUDA AutoTuning Dataset
Lars Bjertnes, Jacob O. T{\o}rring, Anne C. Elster

TL;DR
This paper introduces LS-CAT, a large dataset of CUDA kernels and runtimes from GitHub, to facilitate machine learning models in predicting optimal GPU kernel configurations, leading to significant performance improvements.
Contribution
We present LS-CAT, a comprehensive CUDA auto-tuning dataset with nearly 20,000 kernels and over 5 million runtimes, enabling ML models to predict optimal thread block sizes.
Findings
Optimal thread block size improves performance by 6% on average.
In 10% of cases, performance increases exceed 20%.
Dataset supports training NLP models for code optimization.
Abstract
The effectiveness of Machine Learning (ML) methods depend on access to large suitable datasets. In this article, we present how we build the LS-CAT (Large-Scale CUDA AutoTuning) dataset sourced from GitHub for the purpose of training NLP-based ML models. Our dataset includes 19 683 CUDA kernels focused on linear algebra. In addition to the CUDA codes, our LS-CAT dataset contains 5 028 536 associated runtimes, with different combinations of kernels, block sizes and matrix sizes. The runtime are GPU benchmarks on both Nvidia GTX 980 and Nvidia T4 systems. This information creates a foundation upon which NLP-based models can find correlations between source-code features and optimal choice of thread block sizes. There are several results that can be drawn out of our LS-CAT database. E.g., our experimental results show that an optimal choice in thread block size can gain an average of 6%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
