Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization
Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu

TL;DR
This paper introduces a transformer-based target speaker voice activity detection model that handles variable numbers of speakers and improves diarization accuracy, achieving state-of-the-art results on VoxConverse and CALLHOME datasets.
Contribution
The paper proposes a novel transformer architecture for TS-VAD that handles arbitrary speaker counts and integrates it with end-to-end neural diarization, setting new state-of-the-art performance.
Findings
TS-VAD with transformers reduces DER by 11.3% on VoxConverse.
Extended EEND-EDA with transformer TS-VAD reduces DER by 6.9% on CALLHOME.
Achieves new state-of-the-art diarization error rates on both datasets.
Abstract
This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to make the model output insensitive to the order of the speaker profiles provided to the TS-VAD model. Time-wise sequential layers are interspersed between these speaker-wise transformer layers to allow the temporal and cross-speaker correlations of the input speech signal to be captured. We also extend a diarization model based on end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) by replacing its dot-product-based speaker detection layer with the transformer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Softmax · Layer Normalization · Dropout · Dense Connections · Adam · Position-Wise Feed-Forward Layer
