Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis
Karren Yang, Ting-Yao Hu, Jen-Hao Rick Chang, Hema Swetha Koppula,, Oncel Tuzel

TL;DR
This paper investigates the effectiveness of using personalized synthetic speech data for adapting ASR models to individual speakers, revealing that content relevance is key and proposing a data selection strategy based on speech content.
Contribution
The study demonstrates when synthetic data improves ASR personalization and uncovers that content, not style, drives adaptation effectiveness, leading to a new data selection method.
Findings
Synthetic data enhances ASR personalization, especially for underrepresented speakers.
Limited global model capacity benefits more from synthetic personalization.
Content relevance of synthetic speech is crucial for effective speaker adaptation.
Abstract
Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems
