Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

Yuan Tseng; Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Bhattacharya

arXiv:2505.22251·eess.AS·June 6, 2025

Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

PDF

Open Access

TL;DR

This paper reveals that the evaluation of large language models in speech recognition is often flawed due to test set contamination, which biases results and questions the reliability of previous performance claims.

Contribution

It demonstrates the impact of test set contamination on LLM speech evaluation and emphasizes the need for careful data separation in future assessments.

Findings

01

Contamination in LibriSpeech and Common Voice affects LLM evaluation.

02

Contaminated LLMs favor training data in their outputs.

03

Biases in speech recognition results are caused by data contamination.

Abstract

Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure contamination impact, LLMs trained with/without contamination are compared. A contaminated LLM is more likely to generate test sentences it has seen during training. Then, speech recognisers based on LLMs are compared. They show only subtle error rate differences if the LLM is contaminated, but assign significantly higher probabilities to transcriptions seen during LLM training. Results show that LLM outputs can be biased by tiny amounts of data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Authorship Attribution and Profiling · Topic Modeling