Semi-supervised multi-channel speaker diarization with cross-channel attention
Shilong Wu, Jun Du, Maokui He, Shutong Niu, Hang Chen, Haitao Tang,, Chin-Hui Lee

TL;DR
This paper introduces a semi-supervised multi-channel speaker diarization system that leverages pseudo-labels and cross-channel attention, significantly reducing diarization errors with limited labeled data.
Contribution
It proposes a novel semi-supervised approach with cross-channel attention in neural speaker diarization, improving performance with partial labels.
Findings
Achieved 57.01% relative DER reduction on CHiME-7 dataset.
Performed well with only 50% labeled data on CHiME-6.
Effectively utilized unlabeled data through pseudo-labeling.
Abstract
Most neural speaker diarization systems rely on sufficient manual training data labels, which are hard to collect under real-world scenarios. This paper proposes a semi-supervised speaker diarization system to utilize large-scale multi-channel training data by generating pseudo-labels for unlabeled data. Furthermore, we introduce cross-channel attention into the Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) to learn channel contextual information of speaker embeddings better. Experimental results on the CHiME-7 Mixer6 dataset which only contains partial speakers' labels of the training set, show that our system achieved 57.01% relative DER reduction compared to the clustering-based model on the development set. We further conducted experiments on the CHiME-6 dataset to simulate the scenario of missing partial training set labels. When using 80% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
