A Comparison of Hybrid and End-to-End Models for Syllable Recognition
Sebastian P. Bayerl, Korbinian Riedhammer

TL;DR
This paper compares hybrid and end-to-end models for German syllable recognition, demonstrating that hybrid models with strong language models outperform end-to-end approaches significantly.
Contribution
It provides a detailed comparison showing the continued importance of explicit prior knowledge modeling in speech recognition systems.
Findings
Hybrid models achieved 10.0% WER on syllable recognition.
End-to-end models achieved 27.53% WER.
Structured hybrid approach outperforms end-to-end in syllable recognition.
Abstract
This paper presents a comparison of a traditional hybrid speech recognition system (kaldi using WFST and TDNN with lattice-free MMI) and a lexicon-free end-to-end (TensorFlow implementation of multi-layer LSTM with CTC training) models for German syllable recognition on the Verbmobil corpus. The results show that explicitly modeling prior knowledge is still valuable in building recognition systems. With a strong language model (LM) based on syllables, the structured approach significantly outperforms the end-to-end model. The best word error rate (WER) regarding syllables was achieved using kaldi with a 4-gram LM, modeling all syllables observed in the training set. It achieved 10.0% WER w.r.t. the syllables, compared to the end-to-end approach where the best WER was 27.53%. The work presented here has implications for building future recognition systems that operate independent of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
