Optimal size, freshness and time-frame for voice search vocabulary
Maryam Kamvar, Ciprian Chelba

TL;DR
This study determines the optimal vocabulary size and data freshness needed to minimize out-of-vocabulary rates in voice search, finding that 2-2.5 million words from a week of data achieves a 1% OoV rate for most users.
Contribution
It introduces a method to optimize voice search vocabulary size based on user experience metrics and analyzes the impact of data freshness and window size on OoV rates.
Findings
2-2.5 million words achieve 1% OoV rate for 90% of users
Vocabulary size is a stable indicator of OoV rate
Data freshness and window size have minimal impact on OoV rate
Abstract
In this paper, we investigate how to optimize the vocabulary for a voice search language model. The metric we optimize over is the out-of-vocabulary (OoV) rate since it is a strong indicator of user experience. In a departure from the usual way of measuring OoV rates, web search logs allow us to compute the per-session OoV rate and thus estimate the percentage of users that experience a given OoV rate. Under very conservative text normalization, we find that a voice search vocabulary consisting of 2 to 2.5 million words extracted from 1 week of search query data will result in an aggregate OoV rate of 1%; at that size, the same OoV rate will also be experienced by 90% of users. The number of words included in the vocabulary is a stable indicator of the OoV rate. Altering the freshness of the vocabulary or the duration of the time window over which the training data is gathered does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques
