On permutation invariant training for speech source separation
Xiaoyu Liu, Jordi Pons

TL;DR
This paper advances permutation invariant training (PIT) for speech source separation by extending existing strategies to waveform and latent spaces, proposing scalable clustering loss, and enhancing permutation error reduction, with a focus on model effectiveness.
Contribution
It introduces novel extensions of PIT strategies to waveform and latent spaces, including a scalable clustering loss and an improved auxiliary speaker-ID loss, addressing permutation ambiguity more effectively.
Findings
Extensions reduce permutation ambiguity in speech separation.
STFT-based models outperform waveform models in reducing permutation errors.
Proposed methods improve permutation error handling but highlight limitations of waveform models.
Abstract
We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFT-based models are more effective at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
