HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch
Tina Raissi, Wei Zhou, Simon Berger, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper compares from-scratch full-sum training of HMM and CTC for speech recognition, focusing on accuracy and alignment quality, and proposes methods to improve training convergence.
Contribution
It provides a systematic comparison of HMM and CTC trained from scratch, along with novel methods to enhance convergence and alignment quality.
Findings
HMM and CTC achieve comparable accuracy on benchmarks.
Proposed methods improve convergence speed and alignment quality.
Detailed analysis of alignment methods informs better ASR system design.
Abstract
In this work, we compare from-scratch sequence-level cross-entropy (full-sum) training of Hidden Markov Model (HMM) and Connectionist Temporal Classification (CTC) topologies for automatic speech recognition (ASR). Besides accuracy, we further analyze their capability for generating high-quality time alignment between the speech signal and the transcription, which can be crucial for many subsequent applications. Moreover, we propose several methods to improve convergence of from-scratch full-sum training by addressing the alignment modeling issue. Systematic comparison is conducted on both Switchboard and LibriSpeech corpora across CTC, posterior HMM with and w/o transition probabilities, and standard hybrid HMM. We also provide a detailed analysis of both Viterbi forced-alignment and Baum-Welch full-sum occupation probabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
