Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Bowen Shi; Shane Settle; Karen Livescu

arXiv:2007.00183·eess.AS·November 25, 2020

Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Bowen Shi, Shane Settle, Karen Livescu

PDF

Open Access 1 Repo

TL;DR

This paper introduces an efficient end-to-end whole-word segmental speech recognition model using acoustic word embeddings, demonstrating significant WER improvements through pre-training techniques.

Contribution

It presents a GPU-accelerated approach for whole-word segmental models and explores pre-training with acoustic and grounded word embeddings for improved accuracy.

Findings

01

Pre-training with AWEs reduces word error rate significantly.

02

Pre-training with AGWEs provides additional accuracy gains.

03

The proposed models outperform prior acoustic-to-word models.

Abstract

Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chevalierNoir/A2W-Segmental
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing