Improving Fast-slow Encoder based Transducer with Streaming Deliberation

Ke Li; Jay Mahadeokar; Jinxi Guo; Yangyang Shi; Gil Keren; Ozlem; Kalinli; Michael L. Seltzer; Duc Le

arXiv:2212.07650·eess.AS·December 16, 2022·ICASSP

Improving Fast-slow Encoder based Transducer with Streaming Deliberation

Ke Li, Jay Mahadeokar, Jinxi Guo, Yangyang Shi, Gil Keren, Ozlem, Kalinli, Michael L. Seltzer, Duc Le

PDF

Open Access

TL;DR

This paper enhances fast-slow encoder transducers for speech recognition by integrating a streaming deliberation model, achieving improved accuracy with minimal latency increase through efficient algorithms and data augmentation.

Contribution

It introduces a streaming deliberation model for fast-slow encoder transducers, improving recognition accuracy while maintaining low latency and efficiency.

Findings

01

Achieved 3-5% relative WER reduction on Librispeech and in-house data.

02

Enhanced the transducer with a streaming deliberation model for error correction.

03

Maintained low token emission latency despite accuracy improvements.

Abstract

This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while keeping its latency low by integrating a streaming deliberation model. Specifically, the deliberation model leverages partial hypotheses from the streaming fast encoder and implicitly learns to correct recognition errors. We modify the parallel beam search algorithm for fast-slow encoder based transducer to be efficient and compatible with the deliberation model. In addition, the deliberation model is designed to process streaming data. To further improve the deliberation performance, a simple text augmentation approach is explored. We also compare LSTM and Conformer models for encoding partial hypotheses. Experiments on Librispeech and in-house data show relative WER…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Ultrasonics and Acoustic Wave Propagation

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory