Improvements to Embedding-Matching Acoustic-to-Word ASR Using Multiple-Hypothesis Pronunciation-Based Embeddings
Hao Yen, Woojay Jeon

TL;DR
This paper enhances embedding-matching acoustic-to-word ASR by introducing multiple hypotheses and pronunciation-based embeddings, significantly improving accuracy especially for dynamic OOV words in digital assistant queries.
Contribution
The paper introduces two novel methods: generating multiple embeddings per instance and using pronunciation-based embeddings, to improve embedding-matching A2W accuracy.
Findings
Up to 18% reduction in word error rate on contact name queries.
Significant accuracy improvements with same training data and model size.
Effective handling of dynamic OOV words in real-world scenarios.
Abstract
In embedding-matching acoustic-to-word (A2W) ASR, every word in the vocabulary is represented by a fixed-dimension embedding vector that can be added or removed independently of the rest of the system. The approach is potentially an elegant solution for the dynamic out-of-vocabulary (OOV) words problem, where speaker- and context-dependent named entities like contact names must be incorporated into the ASR on-the-fly for every speech utterance at testing time. Challenges still remain, however, in improving the overall accuracy of embedding-matching A2W. In this paper, we contribute two methods that improve the accuracy of embedding-matching A2W. First, we propose internally producing multiple embeddings, instead of a single embedding, at each instance in time, which allows the A2W model to propose a richer set of hypotheses over multiple time segments in the audio. Second, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
