A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR
Sunil Kumar Kopparapu

TL;DR
This paper introduces a calculus-based method to determine the optimal vocabulary size for end-to-end ASR systems, improving performance by formalizing the hyper-parameter selection process.
Contribution
It formalizes a calculus-based approach to estimate the optimal vocabulary size hyper-parameter for end-to-end ASR training.
Findings
Applying the method on Librispeech improves ASR performance.
The approach effectively estimates vocabulary size using curve fitting.
Optimal vocabulary size enhances model accuracy.
Abstract
In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
