Deep Bayesian Unsupervised Source Separation Based on a Complex Gaussian Mixture Model
Yoshiaki Bando, Yoko Sasaki, Kazuyoshi Yoshii

TL;DR
This paper introduces an unsupervised neural source separation method using a complex Gaussian mixture model that jointly trains separation and localization networks, improving performance without extensive supervised data.
Contribution
It proposes a deep Bayesian framework that jointly trains separation and localization networks using a complex Gaussian mixture model, addressing frequency permutation ambiguity.
Findings
Outperformed conventional initialization methods in simulated speech mixtures
Effectively estimates spatial variables without supervised training data
Enhances multichannel source separation performance
Abstract
This paper presents an unsupervised method that trains neural source separation by using only multichannel mixture signals. Conventional neural separation methods require a lot of supervised data to achieve excellent performance. Although multichannel methods based on spatial information can work without such training data, they are often sensitive to parameter initialization and degraded with the sources located close to each other. The proposed method uses a cost function based on a spatial model called a complex Gaussian mixture model (cGMM). This model has the time-frequency (TF) masks and direction of arrivals (DoAs) of sources as latent variables and is used for training separation and localization networks that respectively estimate these variables. This joint training solves the frequency permutation ambiguity of the spatial model in a unified deep Bayesian framework. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
