Streaming Attention-Based Models with Augmented Memory for End-to-End   Speech Recognition

Ching-Feng Yeh; Yongqiang Wang; Yangyang Shi; Chunyang Wu; Frank; Zhang; Julian Chan; Michael L. Seltzer

arXiv:2011.07120·cs.CL·November 17, 2020

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank, Zhang, Julian Chan, Michael L. Seltzer

PDF

TL;DR

This paper introduces a streaming speech recognition system that combines attention-based neural transducers with augmented memory and convolution, achieving state-of-the-art accuracy with low latency.

Contribution

It presents a novel streaming end-to-end speech recognition model that reduces computational complexity and memory footprint while maintaining high accuracy.

Findings

01

Achieves 2.7% WER on LibriSpeech test-clean

02

Achieves 5.8% WER on LibriSpeech test-other

03

Lowest reported WER among streaming approaches

Abstract

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.