How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An   Extensive Benchmark on Air Traffic Control Communications

Juan Zuluaga-Gomez; Amrutha Prasad; Iuliia Nigmatulina; Saeed Sarfjoo,; Petr Motlicek; Matthias Kleinert; Hartmut Helmke; Oliver Ohneiser; Qingran; Zhan

arXiv:2203.16822·eess.AS·October 18, 2022·5 cites

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Juan Zuluaga-Gomez, Amrutha Prasad, Iuliia Nigmatulina, Saeed Sarfjoo,, Petr Motlicek, Matthias Kleinert, Hartmut Helmke, Oliver Ohneiser, Qingran, Zhan

PDF

Open Access 2 Repos 6 Models 4 Datasets

TL;DR

This paper evaluates the robustness of Wav2Vec 2.0 and XLS-R models on air traffic control speech recognition under domain shift, showing significant improvements over traditional methods and analyzing factors like low-resource performance and gender bias.

Contribution

It provides an extensive benchmark of pre-trained speech models on a new domain, air traffic control, highlighting their robustness and limitations under domain shift conditions.

Findings

01

Wav2Vec 2.0 and XLS-R outperform hybrid ASR baselines by 20-40% WER reduction.

02

Fine-tuning with less labeled data achieves substantial performance gains.

03

Analysis of gender bias and low-resource scenarios reveals model strengths and weaknesses.

Abstract

Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data properties substantially differ between the pre-training and fine-tuning phases, termed domain shift. We target this scenario by analyzing the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications. We benchmark these two models on several open-source and challenging ATC databases with signal-to-noise ratio between 5 and 20 dB. Relative word error rate (WER) reductions between 20% to 40% are obtained in comparison to hybrid-based ASR baselines by only fine-tuning E2E acoustic models with a smaller…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing