UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
Zhong-Qiu Wang, Shinji Watanabe

TL;DR
UNSSOR is an unsupervised neural speech separation method that leverages over-determined training mixtures and mixture constraints to separate speakers in reverberant environments without labeled data.
Contribution
This paper introduces UNSSOR, a novel unsupervised neural speech separation algorithm that uses over-determined mixtures and a mixture sum constraint for training.
Findings
Effective separation in reverberant conditions
Can train on under-determined mixtures for monaural separation
Shows promising results in two-speaker scenarios
Abstract
In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for nsupervised eural peech eparation by leveraging ver-determined training mixtues. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Adaptive Filtering Techniques
