Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

Rogier C. van Dalen; Shucong Zhang; Titouan Parcollet; Sourav Bhattacharya

arXiv:2506.10653·eess.AS·June 13, 2025

Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya

PDF

Open Access

TL;DR

This paper introduces a robust unsupervised adaptation method for speech recognizers using entropy minimisation over multiple hypotheses and speaker codes, achieving significant WER improvements with minimal data.

Contribution

It presents a novel loss function based on conditional entropy over hypotheses and the use of speaker codes for effective unsupervised adaptation.

Findings

01

20% relative WER reduction on 1 minute of data

02

29% WER reduction on 10 minutes of data

03

Effective in noisy, far-field conditions

Abstract

Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or "pseudo-label", this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a "speaker code" characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques