Semi-supervised multi-channel speaker diarization with cross-channel   attention

Shilong Wu; Jun Du; Maokui He; Shutong Niu; Hang Chen; Haitao Tang,; Chin-Hui Lee

arXiv:2307.08688·eess.AS·July 18, 2023·ASRU·1 cites

Semi-supervised multi-channel speaker diarization with cross-channel attention

Shilong Wu, Jun Du, Maokui He, Shutong Niu, Hang Chen, Haitao Tang,, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper introduces a semi-supervised multi-channel speaker diarization system that leverages pseudo-labels and cross-channel attention, significantly reducing diarization errors with limited labeled data.

Contribution

It proposes a novel semi-supervised approach with cross-channel attention in neural speaker diarization, improving performance with partial labels.

Findings

01

Achieved 57.01% relative DER reduction on CHiME-7 dataset.

02

Performed well with only 50% labeled data on CHiME-6.

03

Effectively utilized unlabeled data through pseudo-labeling.

Abstract

Most neural speaker diarization systems rely on sufficient manual training data labels, which are hard to collect under real-world scenarios. This paper proposes a semi-supervised speaker diarization system to utilize large-scale multi-channel training data by generating pseudo-labels for unlabeled data. Furthermore, we introduce cross-channel attention into the Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) to learn channel contextual information of speaker embeddings better. Experimental results on the CHiME-7 Mixer6 dataset which only contains partial speakers' labels of the training set, show that our system achieved 57.01% relative DER reduction compared to the clustering-based model on the development set. We further conducted experiments on the CHiME-6 dataset to simulate the scenario of missing partial training set labels. When using 80% and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques