An Investigation of Monotonic Transducers for Large-Scale Automatic   Speech Recognition

Niko Moritz; Frank Seide; Duc Le; Jay Mahadeokar; Christian Fuegen

arXiv:2204.08858·eess.AS·October 25, 2022

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen

PDF

Open Access

TL;DR

This paper explores monotonic transducers for large-scale automatic speech recognition, showing that with proper training, they can outperform traditional RNN-T models in accuracy and efficiency.

Contribution

It demonstrates that regularizing training of monotonic transducers like MonoRNN-T and CTC-T improves their accuracy to match or surpass RNN-T, especially on large datasets.

Findings

01

Monotonic transducers can outperform RNN-T with proper training.

02

Regularization techniques improve monotonic transducer accuracy.

03

Monotonic transducers are more compatible with traditional decoders.

Abstract

The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible with traditional FST-based ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way: By regularizing the training via joint LAS training or parameter initialization from RNN-T, both MonoRNN-T and CTC-T perform as well or better than RNN-T.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing