Revealing the Role of Audio Channels in ASR Performance Degradation
Kuan-Tang Huang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang

TL;DR
This paper investigates how different audio recording channels affect ASR performance and introduces a normalization method that aligns feature representations to improve robustness across channels and languages.
Contribution
The study identifies channel variation as a key factor in ASR degradation and proposes a normalization technique to mitigate this issue, enhancing cross-channel and cross-language robustness.
Findings
Normalization improves ASR accuracy on unseen channels
Method generalizes well across different languages
Significant performance gains over baseline models
Abstract
Pre-trained automatic speech recognition (ASR) models have demonstrated strong performance on a variety of tasks. However, their performance can degrade substantially when the input audio comes from different recording channels. While previous studies have demonstrated this phenomenon, it is often attributed to the mismatch between training and testing corpora. This study argues that variations in speech characteristics caused by different recording channels can fundamentally harm ASR performance. To address this limitation, we propose a normalization technique designed to mitigate the impact of channel variation by aligning internal feature representations in the ASR model with those derived from a clean reference channel. This approach significantly improves ASR performance on previously unseen channels and languages, highlighting its ability to generalize across channel and language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
