Towards Maximum Likelihood Training for Transducer-based Streaming   Speech Recognition

Hyeonseung Lee; Ji Won Yoon; Sungsoo Kim; Nam Soo Kim

arXiv:2411.17537·eess.AS·November 27, 2024

Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

Hyeonseung Lee, Ji Won Yoon, Sungsoo Kim, Nam Soo Kim

PDF

TL;DR

This paper introduces a novel likelihood estimation method called FoCCE for transducer-based streaming speech recognition, addressing the mismatch between training and inference to improve accuracy.

Contribution

It proposes a mathematical framework and estimator (FoCCE) to accurately compute likelihood, reducing the gap caused by deformed likelihood in streaming transducer models.

Findings

01

FoCCE improves ASR accuracy on LibriSpeech

02

Addresses likelihood mismatch in streaming transducer training

03

Enhances model performance by better likelihood estimation

Abstract

Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.