Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models
Jakob Poncelet, Yujun Wang, Hugo Van hamme

TL;DR
This paper introduces a parameter-efficient method to extract noise-robust discrete speech units from self-supervised models, improving performance in noisy conditions and enabling adaptation with minimal data.
Contribution
A small encoder-decoder model with optional adapters is proposed to denoise and discretize SSL speech features, enhancing noise robustness and adaptability.
Findings
Outperforms existing pre-training methods in noisy discretization
Effective in noisy speech recognition tasks
Can be fine-tuned with limited unlabeled data
Abstract
Continuous speech can be converted into a discrete sequence by deriving discrete units from the hidden features of self-supervised learned (SSL) speech models. Although SSL models are becoming larger and trained on more data, they are often sensitive to real-life distortions like additive noise or reverberation, which translates to a shift in discrete units. We propose a parameter-efficient approach to generate noise-robust discrete units from pre-trained SSL models by training a small encoder-decoder model, with or without adapters, to simultaneously denoise and discretise the hidden features of the SSL model. The model learns to generate a clean discrete sequence for a noisy utterance, conditioned on the SSL features. The proposed denoiser outperforms several pre-training methods on the tasks of noisy discretisation and noisy speech recognition, and can be finetuned to the target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
