Cascaded encoders for unifying streaming and non-streaming ASR

Arun Narayanan; Tara N. Sainath; Ruoming Pang; Jiahui Yu; Chung-Cheng; Chiu; Rohit Prabhavalkar; Ehsan Variani; Trevor Strohman

arXiv:2010.14606·eess.AS·October 29, 2020

Cascaded encoders for unifying streaming and non-streaming ASR

Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng, Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

PDF

TL;DR

This paper introduces a unified end-to-end speech recognition model with cascaded encoders that can operate in both streaming and non-streaming modes, achieving comparable or better accuracy than existing models.

Contribution

The work presents a novel cascaded encoder architecture enabling a single E2E ASR model to function in both streaming and non-streaming modes simultaneously.

Findings

01

Achieves similar WER to standalone streaming models in streaming mode.

02

Obtains 10-27% relative WER improvement in non-streaming mode.

03

Outperforms existing two-pass E2E models, especially on long-form speech.

Abstract

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode. Our results also show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.