Speeding Up Permutation Invariant Training for Source Separation
Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc, Delcroix, Reinhold Haeb-Umbach

TL;DR
This paper introduces an efficient decomposition of permutation invariant training (PIT) for source separation, reducing computational complexity from exponential to polynomial, enabling practical use for large speaker counts and long recordings.
Contribution
The paper proposes a novel decomposition of PIT into matrix computation and a monotonic function, allowing the use of efficient algorithms like Hungarian for uPIT and new algorithms for Graph-PIT.
Findings
Complexity reduced from exponential to polynomial
Efficient algorithms enable large-scale source separation
Improved feasibility for long recordings and many speakers
Abstract
Permutation invariant training (PIT) is a widely used training criterion for neural network-based source separation, used for both utterance-level separation with utterance-level PIT (uPIT) and separation of long recordings with the recently proposed Graph-PIT. When implemented naively, both suffer from an exponential complexity in the number of utterances to separate, rendering them unusable for large numbers of speakers or long realistic recordings. We present a decomposition of the PIT criterion into the computation of a matrix and a strictly monotonously increasing function so that the permutation or assignment problem can be solved efficiently with several search algorithms. The Hungarian algorithm can be used for uPIT and we introduce various algorithms for the Graph-PIT assignment problem to reduce the complexity to be polynomial in the number of utterances.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
Methodsutterance level permutation invariant training
