Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition
Hyeonseung Lee, Ji Won Yoon, Sungsoo Kim, Nam Soo Kim

TL;DR
This paper introduces a novel likelihood estimation method called FoCCE for transducer-based streaming speech recognition, addressing the mismatch between training and inference to improve accuracy.
Contribution
It proposes a mathematical framework and estimator (FoCCE) to accurately compute likelihood, reducing the gap caused by deformed likelihood in streaming transducer models.
Findings
FoCCE improves ASR accuracy on LibriSpeech
Addresses likelihood mismatch in streaming transducer training
Enhances model performance by better likelihood estimation
Abstract
Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
