Improvements to Embedding-Matching Acoustic-to-Word ASR Using   Multiple-Hypothesis Pronunciation-Based Embeddings

Hao Yen; Woojay Jeon

arXiv:2210.16726·eess.AS·February 21, 2023

Improvements to Embedding-Matching Acoustic-to-Word ASR Using Multiple-Hypothesis Pronunciation-Based Embeddings

Hao Yen, Woojay Jeon

PDF

Open Access

TL;DR

This paper enhances embedding-matching acoustic-to-word ASR by introducing multiple hypotheses and pronunciation-based embeddings, significantly improving accuracy especially for dynamic OOV words in digital assistant queries.

Contribution

The paper introduces two novel methods: generating multiple embeddings per instance and using pronunciation-based embeddings, to improve embedding-matching A2W accuracy.

Findings

01

Up to 18% reduction in word error rate on contact name queries.

02

Significant accuracy improvements with same training data and model size.

03

Effective handling of dynamic OOV words in real-world scenarios.

Abstract

In embedding-matching acoustic-to-word (A2W) ASR, every word in the vocabulary is represented by a fixed-dimension embedding vector that can be added or removed independently of the rest of the system. The approach is potentially an elegant solution for the dynamic out-of-vocabulary (OOV) words problem, where speaker- and context-dependent named entities like contact names must be incorporated into the ASR on-the-fly for every speech utterance at testing time. Challenges still remain, however, in improving the overall accuracy of embedding-matching A2W. In this paper, we contribute two methods that improve the accuracy of embedding-matching A2W. First, we propose internally producing multiple embeddings, instead of a single embedding, at each instance in time, which allows the A2W model to propose a richer set of hypotheses over multiple time segments in the audio. Second, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing