Globally Normalising the Transducer for Streaming Speech Recognition
Rogier van Dalen

TL;DR
This paper introduces a method for applying global normalisation to streaming speech recognition models, significantly reducing error rates and improving performance by allowing the model to revise its predictions more effectively.
Contribution
It proposes an approximation of the loss function enabling global normalisation in streaming models, overcoming previous computational challenges.
Findings
Reduces word error rate by 9-11% relative
Closes nearly half the gap between streaming and lookahead modes
Improves model flexibility in streaming speech recognition
Abstract
The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates an output label sequence as it traverses the input sequence. It is straightforward to use in streaming mode, where it generates partial hypotheses before the complete input has been seen. This makes it popular in speech recognition. However, in streaming mode the Transducer has a mathematical flaw which, simply put, restricts the model's ability to change its mind. The fix is to replace local normalisation (e.g. a softmax) with global normalisation, but then the loss function becomes impossible to evaluate exactly. A recent paper proposes to solve this by approximating the model, severely degrading performance. Instead, this paper proposes to approximate the loss function, allowing global normalisation to apply to a state-of-the-art streaming model. Global normalisation reduces its word error rate by 9-11% relative,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
