HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization

Hyebin Ahn; Kangwook Jang; Hoirin Kim

arXiv:2508.12292·cs.SD·August 19, 2025

HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization

Hyebin Ahn, Kangwook Jang, Hoirin Kim

PDF

Open Access

TL;DR

HuBERT-VIC introduces variance-invariance-covariance regularization to enhance noise robustness in speech foundation models, significantly improving performance on noisy speech recognition tasks.

Contribution

This paper presents HuBERT-VIC, a novel regularization method that improves noise robustness in speech foundation models by adjusting statistical properties of speech representations.

Findings

01

23.3% relative improvement on LibriSpeech test-clean

02

13.2% relative improvement on test-other

03

Enhanced generalization across different noise types

Abstract

Noise robustness in speech foundation models (SFMs) has been a critical challenge, as most models are primarily trained on clean data and experience performance degradation when the models are exposed to noisy speech. To address this issue, we propose HuBERT-VIC, a noise-robust SFM with variance, in-variance, and covariance regularization (VICReg) objectives. These objectives adjust the statistics of noisy speech representations, enabling the model to capture diverse acoustic characteristics and improving the generalization ability across different types of noise. When applied to HuBERT, our model shows relative performance improvements of 23.3% on LibriSpeech test-clean and 13.2% on test-other, compared to the baseline model pre-trained on noisy speech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing