Simplified End-to-End MMI Training and Voting for ASR

Lior Fritz; David Burshtein

arXiv:1703.10356·cs.LG·July 18, 2017·6 cites

Simplified End-to-End MMI Training and Voting for ASR

Lior Fritz, David Burshtein

PDF

Open Access

TL;DR

This paper introduces a simplified end-to-end MMI training approach for ASR that improves performance, robustness, and decoding efficiency, and enables effective ensemble methods for better accuracy.

Contribution

It presents a novel simplified MMI training method with end-to-end gradient descent, outperforming CTC in multiple ASR metrics and facilitating straightforward ensemble techniques.

Findings

01

Outperforms CTC in accuracy and robustness

02

Enables effective ensemble averaging for lower WER

03

Reduces decoding time and disk footprint

Abstract

A simplified speech recognition system that uses the maximum mutual information (MMI) criterion is considered. End-to-end training using gradient descent is suggested, similarly to the training of connectionist temporal classification (CTC). We use an MMI criterion with a simple language model in the training stage, and a standard HMM decoder. Our method compares favorably to CTC in terms of performance, robustness, decoding time, disk footprint and quality of alignments. The good alignments enable the use of a straightforward ensemble method, obtained by simply averaging the predictions of several neural network models, that were trained separately end-to-end. The ensemble method yields a considerable reduction in the word error rate.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing