SpeedLimit: Neural Architecture Search for Quantized Transformer Models
Yuji Chai, Luke Bailey, Yunho Jin, Matthew Karle, Glenn G. Ko, David, Brooks, Gu-Yeon Wei, H. T. Kung

TL;DR
SpeedLimit is a NAS method that optimizes transformer models for accuracy while respecting latency constraints, incorporating 8-bit quantization to outperform existing techniques in latency-sensitive applications.
Contribution
Introduces a NAS approach that jointly optimizes accuracy and latency for quantized transformer models, integrating 8-bit quantization during search.
Findings
Outperforms state-of-the-art latency-constrained models
Demonstrates effective balance between accuracy and inference speed
Validates the use of 8-bit quantization in NAS for transformers
Abstract
While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Weight Decay · Attention Dropout
