On lattice-free boosted MMI training of HMM and CTC-based full-context   ASR models

Xiaohui Zhang; Vimal Manohar; David Zhang; Frank Zhang; Yangyang Shi,; Nayan Singhal; Julian Chan; Fuchun Peng; Yatharth Saraf; Mike Seltzer

arXiv:2107.04154·eess.AS·September 28, 2021·1 cites

On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi,, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer

PDF

Open Access

TL;DR

This paper introduces a unified framework for hybrid ASR training using LF-MMI across various modeling units and topologies, proposing novel training schemes that improve performance and efficiency.

Contribution

It generalizes LF-MMI training to full-context models and multiple units, and proposes three new training schemes with demonstrated advantages.

Findings

01

LF-MMI is effective for both limited and full-context models.

02

Proposed schemes improve training performance and decoding efficiency.

03

Bi-char HMM-MMI models outperform traditional GMM-HMMs as alignment models.

Abstract

Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybrid acoustic modeling (AM). In this framework, we show that LF-MMI is a powerful training criterion applicable to both limited-context and full-context models, for wordpiece/mono-char/bi-char/chenone units, with both HMM/CTC topologies. From this framework, we propose three novel training schemes: chenone(ch)/wordpiece(wp)-CTC-bMMI, and wordpiece(wp)-HMM-bMMI with different advantages in training performance, decoding efficiency and decoding time-stamp accuracy. The advantages of different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing