How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications
Juan Zuluaga-Gomez, Amrutha Prasad, Iuliia Nigmatulina, Saeed Sarfjoo,, Petr Motlicek, Matthias Kleinert, Hartmut Helmke, Oliver Ohneiser, Qingran, Zhan

TL;DR
This paper evaluates the robustness of Wav2Vec 2.0 and XLS-R models on air traffic control speech recognition under domain shift, showing significant improvements over traditional methods and analyzing factors like low-resource performance and gender bias.
Contribution
It provides an extensive benchmark of pre-trained speech models on a new domain, air traffic control, highlighting their robustness and limitations under domain shift conditions.
Findings
Wav2Vec 2.0 and XLS-R outperform hybrid ASR baselines by 20-40% WER reduction.
Fine-tuning with less labeled data achieves substantial performance gains.
Analysis of gender bias and low-resource scenarios reveals model strengths and weaknesses.
Abstract
Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data properties substantially differ between the pre-training and fine-tuning phases, termed domain shift. We target this scenario by analyzing the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications. We benchmark these two models on several open-source and challenging ATC databases with signal-to-noise ratio between 5 and 20 dB. Relative word error rate (WER) reductions between 20% to 40% are obtained in comparison to hybrid-based ASR baselines by only fine-tuning E2E acoustic models with a smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Jzuluaga/wav2vec2-xls-r-300m-en-atc-atcosimmodel· 79 dl· ♡ 479 dl♡ 4
- 🤗Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-atcosimmodel· 78 dl· ♡ 678 dl♡ 6
- 🤗Jzuluaga/wav2vec2-xls-r-300m-en-atc-uwb-atccmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗Jzuluaga/wav2vec2-xls-r-300m-en-atc-uwb-atcc-and-atcosimmodel· 20 dl· ♡ 820 dl♡ 8
- 🤗Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-uwb-atccmodel· 34 dl· ♡ 334 dl♡ 3
- 🤗Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-uwb-atcc-and-atcosimmodel· 2.1k dl· ♡ 42.1k dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
