SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Jingru Lin; Meng Ge; Junyi Ao; Liqun Deng; Haizhou Li

arXiv:2407.02826·eess.AS·July 4, 2024·Interspeech

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li

PDF

Open Access

TL;DR

SA-WavLM introduces a novel self-supervised pre-training approach for mixture speech that extracts, merges, and predicts speaker representations, improving performance on multi-speaker tasks.

Contribution

It proposes a new pre-training pipeline with speaker-aware extraction and a speaker shuffling strategy for mixture speech.

Findings

01

SA-WavLM matches or surpasses state-of-the-art models.

02

The speaker shuffling enhances robustness to speaker absence.

03

The model improves multi-speaker speech processing performance.

Abstract

It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction. In this pipeline, SA-WavLM performs speaker-informed extractions with the consideration of the interactions between different speakers. Furthermore, a speaker shuffling strategy is proposed to enhance the robustness towards the speaker absence. Experiments show that SA-WavLM either matches or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing