Folding Attention: Memory and Power Optimization for On-Device   Transformer-based Streaming Speech Recognition

Yang Li; Liangzhen Lai; Yuan Shangguan; Forrest N. Iandola; Zhaoheng; Ni; Ernie Chang; Yangyang Shi; Vikas Chandra

arXiv:2309.07988·cs.LG·January 22, 2024·2 cites

Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Zhaoheng, Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

PDF

Open Access

TL;DR

This paper introduces folding attention, a technique that optimizes memory and power efficiency in on-device Transformer-based streaming speech recognition by reducing model size and power consumption without affecting accuracy.

Contribution

Folding attention specifically targets linear layers in Transformer models, significantly reducing size, memory, and power usage for streaming speech recognition.

Findings

01

Model size reduced by up to 24%

02

Power consumption decreased by up to 23%

03

No loss in model accuracy or computational overhead

Abstract

Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. Experiments on on-device Transformer-based streaming speech recognition models show that folding attention reduces model size…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Softmax · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Multi-Head Attention · Layer Normalization