TL;DR
This paper introduces a novel multi-channel speech separation method that combines variational autoencoders with spatial clustering, outperforming previous models and allowing easier adaptation to new noise environments.
Contribution
It presents a new factorial model based on a generative neural network (VAE) that integrates spectral and spatial information for improved speech separation.
Findings
Outperforms previous factorial GMM models (DOLPHIN)
Performs comparably to permutation invariant training with spatial clustering
Eases adaptation to new noise conditions
Abstract
In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work, we combine the strengths of both approaches, by building a factorial model based on a generative neural network, a variational autoencoder. By doing so, we can exploit the modeling power of neural networks, but at the same time, keep a structured model. Such a model can be advantageous when adapting to new noise conditions as only the noise part of the model needs to be modified. We show experimentally, that our model significantly outperforms previous factorial model based on Gaussian mixture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
