Bolt: Bridging the Gap between Auto-tuners and Hardware-native   Performance

Jiarong Xing; Leyuan Wang; Shang Zhang; Jack Chen; Ang Chen; Yibo Zhu

arXiv:2110.15238·cs.DC·October 29, 2021·21 cites

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, Yibo Zhu

PDF

Open Access

TL;DR

Bolt introduces a novel approach that combines auto-tuners and hardware-native libraries, leveraging modular vendor libraries to significantly improve tensor operation performance and auto-tuning efficiency on GPUs.

Contribution

Bolt presents a hardware-native templated search method that bridges the gap between auto-tuners and vendor libraries, enabling faster and more efficient tensor program optimization.

Findings

01

Achieves 2.5x faster inference speed on CNNs

02

Auto-tunes models within 20 minutes

03

Demonstrates effectiveness on NVIDIA GPUs in production environments

Abstract

Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt is based on the recent trend that vendor libraries are increasingly modularized and reconfigurable via declarative control (e.g., CUTLASS). It enables a novel approach that bridges this gap and achieves the best of both worlds, via hardware-native templated search. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings