Sparse, Efficient, and Semantic Mixture Invariant Training: Taming   In-the-Wild Unsupervised Sound Separation

Scott Wisdom; Aren Jansen; Ron J. Weiss; Hakan Erdogan; John R.; Hershey

arXiv:2106.00847·eess.AS·October 19, 2021

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R., Hershey

PDF

Open Access

TL;DR

This paper introduces novel loss functions and an efficient training method for unsupervised sound separation, addressing over-separation and computational challenges in mixture invariant training, leading to improved separation performance on in-the-wild data.

Contribution

It proposes sparsity and covariance losses to reduce over-separation and an efficient approximation for large source numbers in MixIT, enhancing unsupervised sound separation.

Findings

01

Achieved over 13 dB SI-SNR improvement on FUSS test set.

02

Boosted single-source SI-SNR by over 17 dB.

03

Demonstrated effectiveness of proposed losses and efficient MixIT in real-world data.

Abstract

Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Acoustic Wave Phenomena Research · Music and Audio Processing