Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny,, David Nahamoo

TL;DR
This paper demonstrates direct acoustics-to-word CTC models for English conversational speech recognition, achieving competitive accuracy without the need for an external language model or decoder, and discusses techniques to address data requirements.
Contribution
First implementation of direct acoustics-to-word CTC models on benchmark tasks, eliminating the need for separate language models and decoders in speech recognition.
Findings
Achieved 13.0% WER on Switchboard without LM or decoder.
Presented techniques to mitigate large data requirements for word models.
Compared performance of word and phone CTC models.
Abstract
Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
