Acoustic-To-Word Model Without OOV
Jinyu Li, Guoli Ye, Rui Zhao, Jasha Droppo, Yifan Gong

TL;DR
This paper presents a hybrid CTC model that predicts both words and characters to address the out-of-vocabulary issue in acoustic-to-word speech recognition, improving accuracy on a voice assistant task.
Contribution
The study introduces a hybrid CTC model with synchronized word and character outputs that effectively handles OOV and hot-words in end-to-end speech recognition.
Findings
Reduces OOV-related errors by 30% on Microsoft Cortana task.
Enables recognition of hot-words emerging after training.
Improves end-to-end speech recognition accuracy.
Abstract
Recently, the acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion was shown as a natural end-to-end model directly targeting words as output units. However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node. Therefore, such word-based CTC model can only recognize the frequent words modeled by the network output nodes. It also cannot easily handle the hot-words which emerge after the model is trained. In this study, we improve the acoustic-to-word model with a hybrid CTC model which can predict both words and characters at the same time. With a shared-hidden-layer structure and modular design, the alignments of words generated from the word-based CTC and the character-based CTC are synchronized.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
