Comparison of Decoding Strategies for CTC Acoustic Models
Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias, Sperber, Sebastian St\"uker, Alex Waibel

TL;DR
This paper compares three decoding strategies for CTC-based acoustic models in speech recognition, analyzing their performance and error patterns to guide future research.
Contribution
It provides a comprehensive comparison of fixed-vocabulary, neural language model, and sequence-to-sequence decoding methods for CTC acoustic models.
Findings
Weighted Finite State Transducers are effective for fixed-vocabulary decoding.
Neural language models enable open vocabulary recognition with beam search.
Sequence-to-sequence models offer an alternative translation approach.
Abstract
Connectionist Temporal Classification has recently attracted a lot of interest as it offers an elegant approach to building acoustic models (AMs) for speech recognition. The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols. Output symbols are conditionally independent of each other under CTC loss, so a language model (LM) can be incorporated conveniently during decoding, retaining the traditional separation of acoustic and linguistic components in ASR. For fixed vocabularies, Weighted Finite State Transducers provide a strong baseline for efficient integration of CTC AMs with n-gram LMs. Character-based neural LMs provide a straight forward solution for open vocabulary speech recognition and all-neural models, and can be decoded with beam search. Finally, sequence-to-sequence models can be used to translate a sequence of individual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConnectionist Temporal Classification Loss
