UNSSOR: Unsupervised Neural Speech Separation by Leveraging   Over-determined Training Mixtures

Zhong-Qiu Wang; Shinji Watanabe

arXiv:2305.20054·cs.SD·October 31, 2023·2 cites

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

Zhong-Qiu Wang, Shinji Watanabe

PDF

Open Access 1 Video

TL;DR

UNSSOR is an unsupervised neural speech separation method that leverages over-determined training mixtures and mixture constraints to separate speakers in reverberant environments without labeled data.

Contribution

This paper introduces UNSSOR, a novel unsupervised neural speech separation algorithm that uses over-determined mixtures and a mixture sum constraint for training.

Findings

01

Effective separation in reverberant conditions

02

Can train on under-determined mixtures for monaural separation

03

Shows promising results in two-speaker scenarios

Abstract

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for $u$ nsupervised $n$ eural $s$ peech $s$ eparation by leveraging $o$ ver-determined training mixtu $r$ es. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Adaptive Filtering Techniques