Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word   Speech Recognition

W. Ronny Huang; Cal Peyser; Tara N. Sainath; Ruoming Pang; Trevor; Strohman; Shankar Kumar

arXiv:2203.05008·cs.CL·June 16, 2022

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor, Strohman, Shankar Kumar

PDF

Open Access

TL;DR

This paper introduces three data selection strategies for language models to improve rare-word speech recognition, significantly reducing data size and enhancing performance without harming overall accuracy.

Contribution

The paper proposes simple, effective data selection methods—downsampling, rare-word filtering, and domain-matching—to enhance rare-word recognition in speech systems.

Findings

01

53x data reduction with improved perplexities

02

up to 24% WER reduction on rare-word sentences

03

favorable live voice search evaluation results

Abstract

Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques