Cascaded encoders for unifying streaming and non-streaming ASR
Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng, Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

TL;DR
This paper introduces a unified end-to-end speech recognition model with cascaded encoders that can operate in both streaming and non-streaming modes, achieving comparable or better accuracy than existing models.
Contribution
The work presents a novel cascaded encoder architecture enabling a single E2E ASR model to function in both streaming and non-streaming modes simultaneously.
Findings
Achieves similar WER to standalone streaming models in streaming mode.
Obtains 10-27% relative WER improvement in non-streaming mode.
Outperforms existing two-pass E2E models, especially on long-form speech.
Abstract
End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode. Our results also show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
