Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition
Shane Settle, Kartik Audhkhasi, Karen Livescu, Michael Picheny

TL;DR
This paper introduces acoustically grounded word embeddings to enhance end-to-end acoustics-to-word speech recognition, addressing vocabulary limitations and improving recognition accuracy, especially for out-of-vocabulary words.
Contribution
It proposes a novel approach of integrating acoustically grounded word embeddings into A2W systems, including training-time similarity enforcement and test-time out-of-vocabulary prediction.
Findings
Improved recognition accuracy on conversational telephone speech.
Enhanced handling of out-of-vocabulary words.
Demonstrated benefits of acoustically grounded embeddings in A2W systems.
Abstract
Direct acoustics-to-word (A2W) systems for end-to-end automatic speech recognition are simpler to train, and more efficient to decode with, than sub-word systems. However, A2W systems can have difficulties at training time when data is limited, and at decoding time when recognizing words outside the training vocabulary. To address these shortcomings, we investigate the use of recently proposed acoustic and acoustically grounded word embedding techniques in A2W systems. The idea is based on treating the final pre-softmax weight matrix of an AWE recognizer as a matrix of word embedding vectors, and using an externally trained set of word embeddings to improve the quality of this matrix. In particular we introduce two ideas: (1) Enforcing similarity at training time between the external embeddings and the recognizer weights, and (2) using the word embeddings at test time for predicting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
