PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings
Joonas Kalda, Cl\'ement Pag\'es, Ricard Marxer, Tanel Alum\"ae,, Herv\'e Bredin

TL;DR
PixIT is a novel joint training method combining speaker diarization and speech separation that improves real-world multi-speaker audio processing without extensive fine-tuning, leveraging real recordings and existing clustering techniques.
Contribution
It introduces PixIT, a joint training framework that integrates PIT and MixIT, addressing overseparation and enabling source stitching with minimal speaker diarization labels.
Findings
PixIT improves ASR performance on meeting data.
It reduces overseparation issues in speech separation.
No fine-tuning needed for various ASR systems.
Abstract
A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
