SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Yuji Chai; Luke Bailey; Yunho Jin; Matthew Karle; Glenn G. Ko; David; Brooks; Gu-Yeon Wei; H. T. Kung

arXiv:2209.12127·cs.LG·October 16, 2023

SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Yuji Chai, Luke Bailey, Yunho Jin, Matthew Karle, Glenn G. Ko, David, Brooks, Gu-Yeon Wei, H. T. Kung

PDF

Open Access

TL;DR

SpeedLimit is a NAS method that optimizes transformer models for accuracy while respecting latency constraints, incorporating 8-bit quantization to outperform existing techniques in latency-sensitive applications.

Contribution

Introduces a NAS approach that jointly optimizes accuracy and latency for quantized transformer models, integrating 8-bit quantization during search.

Findings

01

Outperforms state-of-the-art latency-constrained models

02

Demonstrates effective balance between accuracy and inference speed

03

Validates the use of 8-bit quantization in NAS for transformers

Abstract

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Weight Decay · Attention Dropout