Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised   Speech Models

Jakob Poncelet; Yujun Wang; Hugo Van hamme

arXiv:2409.02565·eess.AS·February 6, 2025

Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Jakob Poncelet, Yujun Wang, Hugo Van hamme

PDF

Open Access

TL;DR

This paper introduces a parameter-efficient method to extract noise-robust discrete speech units from self-supervised models, improving performance in noisy conditions and enabling adaptation with minimal data.

Contribution

A small encoder-decoder model with optional adapters is proposed to denoise and discretize SSL speech features, enhancing noise robustness and adaptability.

Findings

01

Outperforms existing pre-training methods in noisy discretization

02

Effective in noisy speech recognition tasks

03

Can be fine-tuned with limited unlabeled data

Abstract

Continuous speech can be converted into a discrete sequence by deriving discrete units from the hidden features of self-supervised learned (SSL) speech models. Although SSL models are becoming larger and trained on more data, they are often sensitive to real-life distortions like additive noise or reverberation, which translates to a shift in discrete units. We propose a parameter-efficient approach to generate noise-robust discrete units from pre-trained SSL models by training a small encoder-decoder model, with or without adapters, to simultaneously denoise and discretise the hidden features of the SSL model. The model learns to generate a clean discrete sequence for a noisy utterance, conditioned on the SSL features. The proposed denoiser outperforms several pre-training methods on the tasks of noisy discretisation and noisy speech recognition, and can be finetuned to the target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis