Towards Listening to 10 People Simultaneously: An Efficient Permutation   Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm

Hideyuki Tachibana

arXiv:2010.11871·cs.SD·May 18, 2021

Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm

Hideyuki Tachibana

PDF

TL;DR

This paper introduces SinkPIT, a new permutation invariant training method using Sinkhorn's algorithm, enabling efficient training of neural networks to separate many audio sources simultaneously, demonstrated with 10 sources.

Contribution

The paper proposes SinkPIT, a scalable permutation invariant training approach using Sinkhorn's algorithm, allowing effective separation of multiple audio sources beyond previous limitations.

Findings

01

Successfully trained a neural network to separate 10 sources.

02

SinkPIT significantly reduces computational complexity compared to traditional PIT.

03

Promising results in multi-source audio separation with SinkPIT.

Abstract

In neural network-based monaural speech separation techniques, it has been recently common to evaluate the loss using the permutation invariant training (PIT) loss. However, the ordinary PIT requires to try all $N!$ permutations between $N$ ground truths and $N$ estimates. Since the factorial complexity explodes very rapidly as $N$ increases, a PIT-based training works only when the number of source signals is small, such as $N = 2$ or $3$ . To overcome this limitation, this paper proposes a SinkPIT, a novel variant of the PIT losses, which is much more efficient than the ordinary PIT loss when $N$ is large. The SinkPIT is based on Sinkhorn's matrix balancing algorithm, which efficiently finds a doubly stochastic matrix which approximates the best permutation in a differentiable manner. The author conducted an experiment to train a neural network model to decompose a single-channel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolutional time-domain audio separation network