Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Pradeep Rangappa; Andres Carofilis; Jeena Prakash; Shashi Kumar; Sergio Burdisso; Srikanth Madikeri; Esau Villatoro-Tello; Bidisha Sharma; Petr Motlicek; Kadri Hacioglu; Shankar Venkatesan; Saurabh Vyas; Andreas Stolcke

arXiv:2506.03681·cs.CL·October 6, 2025

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

PDF

TL;DR

This paper presents a multi-stage filtering approach for selecting high-quality pseudo-labeled data to efficiently adapt ASR models to specific domains, reducing data requirements while maintaining performance.

Contribution

It introduces a robust data selection pipeline combining WER prediction, NER, and CER analysis for improved domain adaptation of ASR models using pseudo-labels.

Findings

01

Filtering reduces training data from 7500 hours to 100 hours with minimal WER increase.

02

The proposed method achieves 12.3% WER on call center data.

03

Similar results are observed on Fisher English dataset.

Abstract

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.