Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav, Batthacharya

TL;DR
This paper introduces a linear-time complexity Conformer Transducer with SummaryMixing for streaming speech recognition, outperforming traditional self-attention models in accuracy and efficiency in both streaming and offline modes.
Contribution
It extends SummaryMixing to a Conformer Transducer suitable for streaming speech recognition, achieving better performance with less compute and memory usage.
Findings
Outperforms self-attention in accuracy for streaming and offline speech recognition.
Requires less compute and memory during training and decoding.
Achieves linear time complexity in speech encoding.
Abstract
Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Speech and Audio Processing · Speech Recognition and Synthesis
