Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance
Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, Yibo Zhu

TL;DR
Bolt introduces a novel approach that combines auto-tuners and hardware-native libraries, leveraging modular vendor libraries to significantly improve tensor operation performance and auto-tuning efficiency on GPUs.
Contribution
Bolt presents a hardware-native templated search method that bridges the gap between auto-tuners and vendor libraries, enabling faster and more efficient tensor program optimization.
Findings
Achieves 2.5x faster inference speed on CNNs
Auto-tunes models within 20 minutes
Demonstrates effectiveness on NVIDIA GPUs in production environments
Abstract
Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt is based on the recent trend that vendor libraries are increasingly modularized and reconfigurable via declarative control (e.g., CUTLASS). It enables a novel approach that bridges this gap and achieves the best of both worlds, via hardware-native templated search. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
