Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Xiaohui Zhang, Vimal Manohar, Daniel Povey, Sanjeev Khudanpur

TL;DR
This paper introduces a data-driven method for automatically learning pronunciations for words in speech recognition systems, combining letter sequences and acoustic evidence, and effectively pruning the lexicon for improved ASR performance.
Contribution
It presents a novel greedy pronunciation selection framework that automatically constructs compact, effective lexicons from transcribed data, outperforming traditional G2P-based methods.
Findings
Achieves near-expert lexicon performance in WER
Outperforms G2P-only lexicons in accuracy
Effective pruning reduces lexicon size without sacrificing quality
Abstract
Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect of the problem that we address is the problem of how to prune entries from such a lexicon (since, empirically, lexicons with too many entries do not tend to be good for ASR performance). Experiments on various ASR tasks show that, with the proposed framework, starting with an initial lexicon of several thousand words, we are able to learn a lexicon which performs close to a full expert lexicon in terms of WER performance on test data, and is better than lexicons built using G2P alone or with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
