Globally Normalising the Transducer for Streaming Speech Recognition

Rogier van Dalen

arXiv:2307.10975·eess.AS·July 21, 2023

Globally Normalising the Transducer for Streaming Speech Recognition

Rogier van Dalen

PDF

Open Access

TL;DR

This paper introduces a method for applying global normalisation to streaming speech recognition models, significantly reducing error rates and improving performance by allowing the model to revise its predictions more effectively.

Contribution

It proposes an approximation of the loss function enabling global normalisation in streaming models, overcoming previous computational challenges.

Findings

01

Reduces word error rate by 9-11% relative

02

Closes nearly half the gap between streaming and lookahead modes

03

Improves model flexibility in streaming speech recognition

Abstract

The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates an output label sequence as it traverses the input sequence. It is straightforward to use in streaming mode, where it generates partial hypotheses before the complete input has been seen. This makes it popular in speech recognition. However, in streaming mode the Transducer has a mathematical flaw which, simply put, restricts the model's ability to change its mind. The fix is to replace local normalisation (e.g. a softmax) with global normalisation, but then the loss function becomes impossible to evaluate exactly. A recent paper proposes to solve this by approximating the model, severely degrading performance. Instead, this paper proposes to approximate the loss function, allowing global normalisation to apply to a state-of-the-art streaming model. Global normalisation reduces its word error rate by 9-11% relative,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing