Cross-Channel Attention-Based Target Speaker Voice Activity Detection:   Experimental Results for M2MeT Challenge

Weiqing Wang; Xiaoyi Qin; Ming Li

arXiv:2202.02687·eess.AS·February 8, 2022

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for M2MeT Challenge

Weiqing Wang, Xiaoyi Qin, Ming Li

PDF

Open Access

TL;DR

This paper introduces a cross-channel attention-based target speaker voice activity detection system for multi-channel meeting transcription, significantly reducing diarization error rate and achieving top performance in the M2MeT challenge.

Contribution

The paper proposes a novel cross-channel self-attention mechanism for target speaker VAD, improving multi-channel diarization accuracy over previous methods.

Findings

01

Single-channel TS-VAD reduces DER by over 75%.

02

Multi-channel TS-VAD further reduces DER by 28%.

03

Achieved 1st place in the M2MeT challenge with a DER of 2.98%.

Abstract

In this paper, we present the speaker diarization system for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU_DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice activity detection (TS-VAD) to find the overlap between speakers. For the single-channel scenario, we separately train a model for each of the 8 channels and fuse the results. We also employ the cross-channel self-attention to further improve the performance, where the non-linear spatial correlations between different channels are learned and fused. Experimental results on the evaluation set show that the single-channel TS-VAD reduces the DER by over 75% from 12.68\% to 3.14%. The multi-channel TS-VAD further reduces the DER by 28% and achieves a DER of 2.26%. Our final submitted system achieves a DER of 2.98% on the AliMeeting test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing