Separating Long-Form Speech with Group-Wise Permutation Invariant Training
Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik, Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

TL;DR
This paper introduces Group-PIT, a novel training scheme for long-form speech separation that improves efficiency and leverages long-span relationships, enhancing multi-talker conversation processing.
Contribution
The paper proposes Group-PIT, a new training method enabling effective long-form speech separation with reduced computational cost and improved handling of long-span utterance relationships.
Findings
Group-PIT improves separation performance on long speech inputs.
The approach reduces computational costs compared to traditional methods.
Effective in simulated meeting-style scenarios.
Abstract
Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the original utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it lacks the ability to leverage the long-span relationship of utterances and is computationally inefficient due to the highly overlapped sliding windows. To overcome these drawbacks, we propose a novel training scheme named Group-PIT, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment. Two different speech separation approaches with Group-PIT are explored, including direct long-span speech separation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
