GPU Performance Portability needs Autotuning
Burkhard Ringlein, Thomas Parnell, Radu Stoica

TL;DR
This paper advocates combining JIT compilation with autotuning to achieve portable, high-performance LLM inference across different GPU hardware, reducing manual optimization and vendor lock-in.
Contribution
It introduces a method that uses autotuning with JIT compilation to enable portable and efficient LLM kernel execution across diverse GPU platforms.
Findings
Explores up to 15x more kernel configurations.
Outperforms vendor-optimized implementations by up to 230%.
Reduces kernel code size by 70x.
Abstract
As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Advanced Data Storage Technologies
MethodsSoftmax · Attention Is All You Need
