An Adapter based Multi-label Pre-training for Speech Separation and Enhancement
Tianrui Wang, Xie Chen, Zhuo Chen, Shu Yu, Weibin Zhu

TL;DR
This paper introduces an adapter-based multi-label pre-training method for HuBERT that enhances speech separation, enhancement, and recognition performance by integrating separation and denoising objectives with minimal additional parameters.
Contribution
It proposes a novel adapter-based architecture for HuBERT that maintains ASR performance while significantly improving speech separation and enhancement tasks.
Findings
Improved performance on speech separation and enhancement tasks.
Maintained or enhanced ASR accuracy.
Faster pre-training with minimal parameter increase.
Abstract
In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the improvements from SSL representation in speech separation (SS) and enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improving the SSL model for SS and SE. We first update HuBERT's masked speech prediction (MSP) objective by integrating the separation and denoising terms, resulting in a multiple pseudo label pre-training scheme, which significantly improves HuBERT's performance on SS and SE but degrades the performance on ASR. To maintain its performance gain on ASR, we further propose an adapter-based architecture for HuBERT's Transformer encoder, where only a few parameters of each layer are adjusted to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Adam · Absolute Position Encodings · Byte Pair Encoding
