Optimal size, freshness and time-frame for voice search vocabulary

Maryam Kamvar; Ciprian Chelba

arXiv:1210.8436·cs.CL·November 1, 2012·1 cites

Optimal size, freshness and time-frame for voice search vocabulary

Maryam Kamvar, Ciprian Chelba

PDF

Open Access

TL;DR

This study determines the optimal vocabulary size and data freshness needed to minimize out-of-vocabulary rates in voice search, finding that 2-2.5 million words from a week of data achieves a 1% OoV rate for most users.

Contribution

It introduces a method to optimize voice search vocabulary size based on user experience metrics and analyzes the impact of data freshness and window size on OoV rates.

Findings

01

2-2.5 million words achieve 1% OoV rate for 90% of users

02

Vocabulary size is a stable indicator of OoV rate

03

Data freshness and window size have minimal impact on OoV rate

Abstract

In this paper, we investigate how to optimize the vocabulary for a voice search language model. The metric we optimize over is the out-of-vocabulary (OoV) rate since it is a strong indicator of user experience. In a departure from the usual way of measuring OoV rates, web search logs allow us to compute the per-session OoV rate and thus estimate the percentage of users that experience a given OoV rate. Under very conservative text normalization, we find that a voice search vocabulary consisting of 2 to 2.5 million words extracted from 1 week of search query data will result in an aggregate OoV rate of 1%; at that size, the same OoV rate will also be experienced by 90% of users. The number of words included in the vocabulary is a stable indicator of the OoV rate. Altering the freshness of the vocabulary or the duration of the time window over which the training data is gathered does not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques