Which Data Matter? Embedding-Based Data Selection for Speech Recognition

Zakaria Aldeneh; Skyler Seto; Maureen de Seyssel; Jie Chi; Zijin Gu; Takuya Higuchi; Jee-weon Jung; Shinji Watanabe; David Grangier; Barry-John Theobald; Tatiana Likhomanenko

arXiv:2603.05819·cs.SD·March 16, 2026

Which Data Matter? Embedding-Based Data Selection for Speech Recognition

Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel, Jie Chi, Zijin Gu, Takuya Higuchi, Jee-weon Jung, Shinji Watanabe, David Grangier, Barry-John Theobald, Tatiana Likhomanenko

PDF

Open Access

TL;DR

This paper explores embedding-based data selection for speech recognition, demonstrating that carefully chosen small data subsets can outperform full datasets on specific domains.

Contribution

It introduces a method for selecting relevant data subsets using embeddings that capture multiple speech characteristics, improving domain-specific ASR performance.

Findings

01

Selecting 5% of data can outperform full dataset training by up to 36.8% WER reduction.

02

Embedding-based selection effectively balances relevance and diversity for targeted domain adaptation.

03

Targeted data selection enhances ASR performance with significantly less training data.

Abstract

Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research