Simplified End-to-End MMI Training and Voting for ASR
Lior Fritz, David Burshtein

TL;DR
This paper introduces a simplified end-to-end MMI training approach for ASR that improves performance, robustness, and decoding efficiency, and enables effective ensemble methods for better accuracy.
Contribution
It presents a novel simplified MMI training method with end-to-end gradient descent, outperforming CTC in multiple ASR metrics and facilitating straightforward ensemble techniques.
Findings
Outperforms CTC in accuracy and robustness
Enables effective ensemble averaging for lower WER
Reduces decoding time and disk footprint
Abstract
A simplified speech recognition system that uses the maximum mutual information (MMI) criterion is considered. End-to-end training using gradient descent is suggested, similarly to the training of connectionist temporal classification (CTC). We use an MMI criterion with a simple language model in the training stage, and a standard HMM decoder. Our method compares favorably to CTC in terms of performance, robustness, decoding time, disk footprint and quality of alignments. The good alignments enable the use of a straightforward ensemble method, obtained by simply averaging the predictions of several neural network models, that were trained separately end-to-end. The ensemble method yields a considerable reduction in the word error rate.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
