Speech Robust Bench: A Robustness Benchmark For Speech Recognition
Muhammad A. Shah, David Solans Noguero, Mikko A. Heikkila, Bhiksha Raj, and Nicolas Kourtellis

TL;DR
This paper introduces Speech Robust Bench (SRB), a comprehensive benchmark with 114 perturbations designed to evaluate and compare the robustness of speech recognition models across diverse real-world corruptions and demographic groups.
Contribution
The paper presents SRB, a new benchmark for evaluating ASR robustness, and provides analysis of model performance across different model architectures and demographic subgroups.
Findings
Model size and certain modeling choices improve robustness.
Significant disparities in robustness across demographic groups.
SRB enables comprehensive evaluation of ASR models.
Abstract
As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results…
Peer Reviews
Decision·ICLR 2025 Poster
### originality This work is combines multiple existing datasets to build a comprehensive analysis tool for ASR models. The authors place their work fairly into existing literature. ### quality The writing is mostly error-free and clear to understand. The authors analyzed relevant, contemporary ASR models. Nice-to-have but not required models to include could've been WavLM and, perhaps, OWSM. ### clarity The authors use a familiar section lay-out for their paper. I also appreciat
### Benchmark protocol is unclear Section 3 is missing important details for reproducing this work. I think practitioners cannot replicate the benchmark protocol by reading the paper as-is. It is critical for the usefulness of a benchmark that it can be independently reproduced. I have the following remarks: 1. I find the usage of the word 'Perturbations' unclear. I see how adversarial attacks, environmental effects, and digital augmentations can be classified as perturbations of the audio sig
* The experiments are comprehensive. It considers a couple state-of-the-art ASR systems, and it analyzes adversarial/non-adversarial perturbations. * It analyzed the correlation between different models in terms of robustness. * It analyzed the robustness of speakers from different sub-groups (accent, gender).
Although the analyzes are very comprehensive, the contributions/novelty are not clearly stated. According to the quoted claim above, it feels like the most significant contribution compared to prior work is having more noise sources? Please explain more about the contributions/novelty here. It would be helpful to have a table to compare between the proposed approaches with baselines on the diffs from the 4 modules.
1. The paper conducts extensive and detailed analysis to evaluate the robustness of recent ASR models. The Takeaways help develop robust ASR models.
1. While I appreciate the efforts to evaluate the robustness of recent ASR models, the paper offers limited novelty. I suggest that the authors compare their benchmarks to existing robustness metrics or frameworks in the ASR field. 2. To evaluate the robustness of ASR models, the current framework adds noise to clean speech data to evaluate the performance. How do recent ASR models perform in real-world noisy data? In other words, how can we evaluate the robustness of ASR models to real-world n
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
