Building competitive direct acoustics-to-word models for English conversational speech recognition
Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon,, Michael Picheny

TL;DR
This paper presents a method to train direct acoustics-to-word speech recognition models that achieve state-of-the-art accuracy comparable to sub-word models without using decoders or language models, by optimizing training procedures.
Contribution
It introduces a training recipe that significantly improves A2W model performance and proposes a joint word-character model for better handling of unseen words.
Findings
Achieved 8.8% WER on Switchboard without language models
Identified key factors like initialization and data order impacting performance
Proposed a joint word-character model for improved recognition of rare words
Abstract
Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple. Prior work has shown that A2W models require orders of magnitude more training data in order to perform comparably to conventional models. Our work also showed this accuracy gap when using the English Switchboard-Fisher data set. This paper describes a recipe to train an A2W model that closes this gap and is at-par with state-of-the-art sub-word based models. We achieve a word error rate of 8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
