HMM vs. CTC for Automatic Speech Recognition: Comparison Based on   Full-Sum Training from Scratch

Tina Raissi; Wei Zhou; Simon Berger; Ralf Schl\"uter; Hermann Ney

arXiv:2210.09951·cs.SD·October 19, 2022·1 cites

HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch

Tina Raissi, Wei Zhou, Simon Berger, Ralf Schl\"uter, Hermann Ney

PDF

Open Access

TL;DR

This paper compares from-scratch full-sum training of HMM and CTC for speech recognition, focusing on accuracy and alignment quality, and proposes methods to improve training convergence.

Contribution

It provides a systematic comparison of HMM and CTC trained from scratch, along with novel methods to enhance convergence and alignment quality.

Findings

01

HMM and CTC achieve comparable accuracy on benchmarks.

02

Proposed methods improve convergence speed and alignment quality.

03

Detailed analysis of alignment methods informs better ASR system design.

Abstract

In this work, we compare from-scratch sequence-level cross-entropy (full-sum) training of Hidden Markov Model (HMM) and Connectionist Temporal Classification (CTC) topologies for automatic speech recognition (ASR). Besides accuracy, we further analyze their capability for generating high-quality time alignment between the speech signal and the transcription, which can be crucial for many subsequent applications. Moreover, we propose several methods to improve convergence of from-scratch full-sum training by addressing the alignment modeling issue. Systematic comparison is conducted on both Switchboard and LibriSpeech corpora across CTC, posterior HMM with and w/o transition probabilities, and standard hybrid HMM. We also provide a detailed analysis of both Viterbi forced-alignment and Baum-Welch full-sum occupation probabilities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing