Robustness of end-to-end Automatic Speech Recognition Models -- A Case   Study using Mozilla DeepSpeech

Aashish Agarwal; Torsten Zesch

arXiv:2105.09742·cs.CL·May 21, 2021·1 cites

Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeech

Aashish Agarwal, Torsten Zesch

PDF

Open Access

TL;DR

This paper investigates the robustness of end-to-end automatic speech recognition models, specifically Mozilla DeepSpeech, highlighting how dataset biases and overlaps can significantly underestimate true error rates.

Contribution

It provides a detailed analysis of how dataset selection bias, gender, and content overlap affect the perceived performance of speech recognition models.

Findings

01

Content overlap greatly impacts error rates

02

Gender influences model performance

03

Dataset biases can underestimate true error rates

Abstract

When evaluating the performance of automatic speech recognition models, usually word error rate within a certain dataset is used. Special care must be taken in understanding the dataset in order to report realistic performance numbers. We argue that many performance numbers reported probably underestimate the expected error rate. We conduct experiments controlling for selection bias, gender as well as overlap (between training and test data) in content, voices, and recording conditions. We find that content overlap has the biggest impact, but other factors like gender also play a role.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing