Streaming parallel transducer beam search with fast-slow cascaded encoders
Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas, Chandra, Ozlem Kalinli, Michael L Seltzer

TL;DR
This paper introduces a parallel cascaded encoder approach with fast and slow streams for streaming ASR, utilizing a novel beam search to improve accuracy while maintaining low latency and computational efficiency.
Contribution
It proposes a new parallel time-synchronous beam search for transducers using fast-slow cascaded encoders with variable context sizes, enhancing accuracy and efficiency in streaming ASR.
Findings
Achieves up to 20% WER reduction on Librispeech.
Maintains low latency with slight delay increase.
Reduces computation and memory footprint for edge deployment.
Abstract
Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders. This work improves upon this cascaded encoders framework by leveraging two streaming non-causal encoders with variable input context sizes that can produce outputs at different audio intervals (e.g. fast and slow). We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders, where the slow encoder corrects the mistakes generated from the fast encoder. The proposed algorithm, achieves up to 20% WER reduction with a slight increase in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Blind Source Separation Techniques
