Linear Time Complexity Conformers with SummaryMixing for Streaming   Speech Recognition

Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav; Batthacharya

arXiv:2409.07165·cs.SD·September 12, 2024

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav, Batthacharya

PDF

Open Access 1 Repo

TL;DR

This paper introduces a linear-time complexity Conformer Transducer with SummaryMixing for streaming speech recognition, outperforming traditional self-attention models in accuracy and efficiency in both streaming and offline modes.

Contribution

It extends SummaryMixing to a Conformer Transducer suitable for streaming speech recognition, achieving better performance with less compute and memory usage.

Findings

01

Outperforms self-attention in accuracy for streaming and offline speech recognition.

02

Requires less compute and memory during training and decoding.

03

Achieves linear time complexity in speech encoding.

Abstract

Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samsunglabs/summarymixing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Speech and Audio Processing · Speech Recognition and Synthesis