Pushing the Limits of Online Auto-tuning: Machine Code Optimization in Short-Running Kernels
Fernando Endo, Damien Courouss\'e, Henri-Pierre Charles

TL;DR
This paper introduces an online auto-tuning method for short-running kernels that optimizes machine code directly, achieving significant speedups with minimal overhead in very brief applications.
Contribution
It presents a novel approach to auto-tuning at the machine code level for short-lived kernels, enabling effective optimization in applications lasting only seconds.
Findings
Average speedups of 1.10 to 1.58 in CPU-bound kernels
Up to 2.53 speedup in favorable conditions
Overhead of 0.2% to 4.2% of total execution time
Abstract
We propose an online auto-tuning approach for computing kernels. Differently from existing online auto-tuners, which regenerate code with long compilation chains from the source to the binary code, our approach consists on deploying auto-tuning directly at the level of machine code generation. This allows auto-tuning to pay off in very short-running applications. As a proof of concept, our approach is demonstrated in two benchmarks, which execute during hundreds of milliseconds to a few seconds only. In a CPU-bound kernel, the average speedups achieved are 1.10 to 1.58 depending on the target micro-architecture, up to 2.53 in the most favourable conditions (all run-time overheads included). In a memory-bound kernel, less favourable to our runtime auto-tuning optimizations, the average speedups are 1.04 to 1.10, up to 1.30 in the best configuration. Despite the short execution times of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
