FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition
Dongning Yang, Wei Wang, Yanmin Qian

TL;DR
FAT-HuBERT introduces a distortion-invariant self-supervised learning approach with layer-wise fusion modules to improve robustness of speech recognition systems against distortions from speech enhancement frontends, showing significant WER reduction.
Contribution
The paper presents a novel FAT-HuBERT framework that incorporates distortion-invariant SSL and layer-wise feature fusion to enhance ASR robustness against speech distortions.
Findings
Significant WER reduction on LibriSpeech and CHiME-4 datasets.
Layer-wise fusion improves robustness to speech distortions.
Random selection of SE frontends during training enhances generalization.
Abstract
Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. To address the distortions introduced by the SE frontends, we introduce layer-wise fusion modules that incorporate features extracted from both observed noisy signals and enhanced signals. During training, the SE frontend is randomly selected from a pool of models. We evaluate the performance of FAT-HuBERT on simulated noisy speech generated from LibriSpeech as well as real-world noisy speech from the CHiME-4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research
