Improving Fast-slow Encoder based Transducer with Streaming Deliberation
Ke Li, Jay Mahadeokar, Jinxi Guo, Yangyang Shi, Gil Keren, Ozlem, Kalinli, Michael L. Seltzer, Duc Le

TL;DR
This paper enhances fast-slow encoder transducers for speech recognition by integrating a streaming deliberation model, achieving improved accuracy with minimal latency increase through efficient algorithms and data augmentation.
Contribution
It introduces a streaming deliberation model for fast-slow encoder transducers, improving recognition accuracy while maintaining low latency and efficiency.
Findings
Achieved 3-5% relative WER reduction on Librispeech and in-house data.
Enhanced the transducer with a streaming deliberation model for error correction.
Maintained low token emission latency despite accuracy improvements.
Abstract
This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while keeping its latency low by integrating a streaming deliberation model. Specifically, the deliberation model leverages partial hypotheses from the streaming fast encoder and implicitly learns to correct recognition errors. We modify the parallel beam search algorithm for fast-slow encoder based transducer to be efficient and compatible with the deliberation model. In addition, the deliberation model is designed to process streaming data. To further improve the deliberation performance, a simple text augmentation approach is explored. We also compare LSTM and Conformer models for encoding partial hypotheses. Experiments on Librispeech and in-house data show relative WER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Ultrasonics and Acoustic Wave Propagation
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
