Building competitive direct acoustics-to-word models for English   conversational speech recognition

Kartik Audhkhasi; Brian Kingsbury; Bhuvana Ramabhadran; George Saon,; Michael Picheny

arXiv:1712.03133·cs.CL·December 11, 2017

Building competitive direct acoustics-to-word models for English conversational speech recognition

Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon,, Michael Picheny

PDF

TL;DR

This paper presents a method to train direct acoustics-to-word speech recognition models that achieve state-of-the-art accuracy comparable to sub-word models without using decoders or language models, by optimizing training procedures.

Contribution

It introduces a training recipe that significantly improves A2W model performance and proposes a joint word-character model for better handling of unseen words.

Findings

01

Achieved 8.8% WER on Switchboard without language models

02

Identified key factors like initialization and data order impacting performance

03

Proposed a joint word-character model for improved recognition of rare words

Abstract

Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple. Prior work has shown that A2W models require orders of magnitude more training data in order to perform comparably to conventional models. Our work also showed this accuracy gap when using the English Switchboard-Fisher data set. This paper describes a recipe to train an A2W model that closes this gap and is at-par with state-of-the-art sub-word based models. We achieve a word error rate of 8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.