HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer   Compression

Jiaqi Gu; Ben Keller; Jean Kossaifi; Anima Anandkumar; Brucek; Khailany; David Z. Pan

arXiv:2211.16749·cs.LG·December 1, 2022·1 cites

HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression

Jiaqi Gu, Ben Keller, Jean Kossaifi, Anima Anandkumar, Brucek, Khailany, David Z. Pan

PDF

Open Access

TL;DR

This paper introduces HEAT, a hardware-aware tensor decomposition framework that automates and optimizes transformer compression, significantly improving energy efficiency with minimal accuracy loss.

Contribution

HEAT provides an automated, hardware-aware tensor decomposition method for transformer compression, optimizing tensorization and decomposition ranks for better hardware efficiency.

Findings

01

Reduces energy-delay product by 5.7x

02

Less than 1.1% accuracy loss

03

Outperforms hand-tuned baselines

Abstract

Transformers have attained superior performance in natural language processing and computer vision. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in a factorized form. Prior efforts used manual or heuristic factorization settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation. In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Residual Connection · Linear Warmup With Linear Decay · Layer Normalization