Self-supervised learning with diffusion-based multichannel speech   enhancement for speaker verification under noisy conditions

Sandipana Dowerah; Ajinkya Kulkarni; Romain Serizel (MULTISPEECH),; Denis Jouvet

arXiv:2307.02244·cs.SD·July 6, 2023

Self-supervised learning with diffusion-based multichannel speech enhancement for speaker verification under noisy conditions

Sandipana Dowerah, Ajinkya Kulkarni, Romain Serizel (MULTISPEECH),, Denis Jouvet

PDF

Open Access

TL;DR

This paper proposes Diff-Filter, a diffusion-based multichannel speech enhancement method combined with self-supervised training to improve speaker verification accuracy in noisy, reverberant environments.

Contribution

It introduces a novel diffusion probabilistic model for speech enhancement and a two-step self-supervised training procedure for speaker verification.

Findings

01

Significant performance improvements on MultiSV dataset.

02

Effective noise and reverberation suppression in multichannel conditions.

03

Demonstrates the benefit of self-supervised learning for speaker verification.

Abstract

The paper introduces Diff-Filter, a multichannel speech enhancement approach based on the diffusion probabilistic model, for improving speaker verification performance under noisy and reverberant conditions. It also presents a new two-step training procedure that takes the benefit of self-supervised learning. In the first stage, the Diff-Filter is trained by conducting timedomain speech filtering using a scoring-based diffusion model. In the second stage, the Diff-Filter is jointly optimized with a pre-trained ECAPA-TDNN speaker verification model under a self-supervised learning framework. We present a novel loss based on equal error rate. This loss is used to conduct selfsupervised learning on a dataset that is not labelled in terms of speakers. The proposed approach is evaluated on MultiSV, a multichannel speaker verification dataset, and shows significant improvements in performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing