NTT speaker diarization system for CHiME-7: multi-domain,   multi-microphone End-to-end and vector clustering diarization

Naohiro Tawara; Marc Delcroix; Atsushi Ando; Atsunori Ogawa

arXiv:2309.12656·eess.AS·September 25, 2023

NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization

Naohiro Tawara, Marc Delcroix, Atsushi Ando, Atsunori Ogawa

PDF

Open Access

TL;DR

This paper presents a multi-microphone, multi-domain speaker diarization system combining dereverberation, end-to-end neural diarization, vector clustering, and self-supervised adaptation, achieving significant improvements in CHiME-7 challenge performance.

Contribution

It introduces a novel multi-channel diarization pipeline with self-supervised domain adaptation, enhancing performance over baseline systems in complex conversational environments.

Findings

01

Achieved 65% and 62% relative improvements over baseline on development and eval sets.

02

Secured third place in CHiME-7 diarization performance.

03

Demonstrated effectiveness of self-supervised adaptation in multi-microphone diarization.

Abstract

This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing