SummaryMixing: A Linear-Complexity Alternative to Self-Attention for   Speech Recognition and Understanding

Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav; Bhattacharya

arXiv:2307.07421·cs.CL·July 12, 2024·1 cites

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav, Bhattacharya

PDF

Open Access 1 Repo

TL;DR

SummaryMixing is a novel linear-time method for speech recognition that replaces self-attention, achieving comparable or better accuracy while significantly reducing training and inference time and memory usage.

Contribution

It introduces SummaryMixing, a new linear-complexity token mixing method that maintains high accuracy in speech recognition tasks.

Findings

01

Up to 28% faster training and inference.

02

Memory usage reduced by half.

03

Maintains or exceeds state-of-the-art accuracy.

Abstract

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samsunglabs/summarymixing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Speech and Audio Processing

Methodsfail