Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings
Bowen Shi, Shane Settle, Karen Livescu

TL;DR
This paper introduces an efficient end-to-end whole-word segmental speech recognition model using acoustic word embeddings, demonstrating significant WER improvements through pre-training techniques.
Contribution
It presents a GPU-accelerated approach for whole-word segmental models and explores pre-training with acoustic and grounded word embeddings for improved accuracy.
Findings
Pre-training with AWEs reduces word error rate significantly.
Pre-training with AGWEs provides additional accuracy gains.
The proposed models outperform prior acoustic-to-word models.
Abstract
Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
