Global Normalization for Streaming Speech Recognition in a Modular   Framework

Ehsan Variani; Ke Wu; Michael Riley; David Rybach; Matt Shannon; Cyril; Allauzen

arXiv:2205.13674·cs.LG·May 30, 2022·6 cites

Global Normalization for Streaming Speech Recognition in a Modular Framework

Ehsan Variani, Ke Wu, Michael Riley, David Rybach, Matt Shannon, Cyril, Allauzen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents GNAT, a globally normalized autoregressive model for streaming speech recognition, significantly reducing the word error rate gap compared to non-streaming models within a flexible, modular framework.

Contribution

It introduces GNAT, a novel globally normalized model that addresses label bias in streaming speech recognition, with a tractable normalization computation and modular design.

Findings

01

Reduces WER gap by over 50% on Librispeech

02

Enables controlled comparison of speech recognition models

03

Provides a modular framework for neural speech recognition

Abstract

We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50\% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/last
jax

Videos

Global Normalization for Streaming Speech Recognition in a Modular Framework· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing