HAT: Hardware-Aware Transformers for Efficient Natural Language   Processing

Hanrui Wang; Zhanghao Wu; Zhijian Liu; Han Cai; Ligeng Zhu; Chuang; Gan; Song Han

arXiv:2005.14187·cs.CL·April 5, 2024·21 cites

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang, Gan, Song Han

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces Hardware-Aware Transformers (HAT), a neural architecture search method that designs efficient, hardware-specific transformer models for NLP tasks, significantly improving speed and size on resource-constrained devices.

Contribution

HAT employs neural architecture search with a large design space and evolutionary algorithms to create hardware-optimized transformer models for diverse hardware platforms.

Findings

01

3x speedup on Raspberry Pi-4

02

3.7x smaller model size compared to baseline

03

12,041x less search cost with no performance loss

Abstract

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with $arbitrary encoder-decoder attention$ and $heterogeneous layers$ . Then we train a $SuperTransformer$ that covers all candidates in the design space, and efficiently produces many $SubTransformers$ with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized $SubTransformer$ dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

[ACL 2020] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing· youtube

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Adversarial Robustness in Machine Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding