Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts
Ana\"is Baranger, Lucas Maison

TL;DR
This paper investigates how training conditions and input features influence the robustness of speech command recognition models under noise and distribution shifts, benchmarking various architectures and analyzing noise-aware training effects.
Contribution
It provides a comprehensive analysis of factors affecting robustness and generalization in speech models, highlighting the impact of noise-aware training and evaluation metrics.
Findings
Noise-aware training can improve robustness in certain configurations.
Evaluation metrics like Fairness and Robustness effectively quantify generalization.
Benchmarking reveals architecture-specific responses to noise and distribution shifts.
Abstract
Although prior work in computer vision has shown strong correlations between in-distribution (ID) and out-of-distribution (OOD) accuracies, such relationships remain underexplored in audio-based models. In this study, we investigate how training conditions and input features affect the robustness and generalization abilities of spoken keyword classifiers under OOD conditions. We benchmark several neural architectures across a variety of evaluation sets. To quantify the impact of noise on generalization, we make use of two metrics: Fairness (F), which measures overall accuracy gains compared to a baseline model, and Robustness (R), which assesses the convergence between ID and OOD performance. Our results suggest that noise-aware training improves robustness in some configurations. These findings shed new light on the benefits and limitations of noise-based augmentation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
