An Adapter based Multi-label Pre-training for Speech Separation and   Enhancement

Tianrui Wang; Xie Chen; Zhuo Chen; Shu Yu; Weibin Zhu

arXiv:2211.06041·eess.AS·November 14, 2022·ICASSP

An Adapter based Multi-label Pre-training for Speech Separation and Enhancement

Tianrui Wang, Xie Chen, Zhuo Chen, Shu Yu, Weibin Zhu

PDF

Open Access

TL;DR

This paper introduces an adapter-based multi-label pre-training method for HuBERT that enhances speech separation, enhancement, and recognition performance by integrating separation and denoising objectives with minimal additional parameters.

Contribution

It proposes a novel adapter-based architecture for HuBERT that maintains ASR performance while significantly improving speech separation and enhancement tasks.

Findings

01

Improved performance on speech separation and enhancement tasks.

02

Maintained or enhanced ASR accuracy.

03

Faster pre-training with minimal parameter increase.

Abstract

In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the improvements from SSL representation in speech separation (SS) and enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improving the SSL model for SS and SE. We first update HuBERT's masked speech prediction (MSP) objective by integrating the separation and denoising terms, resulting in a multiple pseudo label pre-training scheme, which significantly improves HuBERT's performance on SS and SE but degrades the performance on ASR. To maintain its performance gain on ASR, we further propose an adapter-based architecture for HuBERT's Transformer encoder, where only a few parameters of each layer are adjusted to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Adam · Absolute Position Encodings · Byte Pair Encoding