Streaming parallel transducer beam search with fast-slow cascaded   encoders

Jay Mahadeokar; Yangyang Shi; Ke Li; Duc Le; Jiedan Zhu; Vikas; Chandra; Ozlem Kalinli; Michael L Seltzer

arXiv:2203.15773·cs.CL·March 30, 2022

Streaming parallel transducer beam search with fast-slow cascaded encoders

Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas, Chandra, Ozlem Kalinli, Michael L Seltzer

PDF

Open Access

TL;DR

This paper introduces a parallel cascaded encoder approach with fast and slow streams for streaming ASR, utilizing a novel beam search to improve accuracy while maintaining low latency and computational efficiency.

Contribution

It proposes a new parallel time-synchronous beam search for transducers using fast-slow cascaded encoders with variable context sizes, enhancing accuracy and efficiency in streaming ASR.

Findings

01

Achieves up to 20% WER reduction on Librispeech.

02

Maintains low latency with slight delay increase.

03

Reduces computation and memory footprint for edge deployment.

Abstract

Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders. This work improves upon this cascaded encoders framework by leveraging two streaming non-causal encoders with variable input context sizes that can produce outputs at different audio intervals (e.g. fast and slow). We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders, where the slow encoder corrects the mistakes generated from the fast encoder. The proposed algorithm, achieves up to 20% WER reduction with a slight increase in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Blind Source Separation Techniques