Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition
Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Han Lu, Hasim Sak

TL;DR
This paper introduces a Transformer-Transducer model that unifies streaming and non-streaming speech recognition, allowing flexible latency-accuracy trade-offs and achieving significant accuracy improvements with minimal additional latency.
Contribution
The paper proposes a novel Transformer-Transducer architecture with variable right context layers, enabling a single model to perform both streaming and non-streaming speech recognition.
Findings
Achieves 20% relative accuracy improvement on voice-search task.
Allows dynamic adjustment of context length for latency-accuracy trade-offs.
Optimizations enable faster inference in both streaming and non-streaming modes.
Abstract
In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model. The model is composed of a stack of transformer layers for audio encoding with no lookahead or right context and an additional stack of transformer layers on top trained with variable right context. In inference time, the context length for the variable context layers can be changed to trade off the latency and the accuracy of the model. We also show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes. This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy (20% relative improvement for voice-search task). We show that with limited right…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
