Target Speaker Voice Activity Detection with Transformers and Its   Integration with End-to-End Neural Diarization

Dongmei Wang; Xiong Xiao; Naoyuki Kanda; Takuya Yoshioka; Jian Wu

arXiv:2208.13085·eess.AS·September 27, 2022·1 cites

Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu

PDF

Open Access

TL;DR

This paper introduces a transformer-based target speaker voice activity detection model that handles variable numbers of speakers and improves diarization accuracy, achieving state-of-the-art results on VoxConverse and CALLHOME datasets.

Contribution

The paper proposes a novel transformer architecture for TS-VAD that handles arbitrary speaker counts and integrates it with end-to-end neural diarization, setting new state-of-the-art performance.

Findings

01

TS-VAD with transformers reduces DER by 11.3% on VoxConverse.

02

Extended EEND-EDA with transformer TS-VAD reduces DER by 6.9% on CALLHOME.

03

Achieves new state-of-the-art diarization error rates on both datasets.

Abstract

This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to make the model output insensitive to the order of the speaker profiles provided to the TS-VAD model. Time-wise sequential layers are interspersed between these speaker-wise transformer layers to allow the temporal and cross-speaker correlations of the input speech signal to be captured. We also extend a diarization model based on end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) by replacing its dot-product-based speaker detection layer with the transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Softmax · Layer Normalization · Dropout · Dense Connections · Adam · Position-Wise Feed-Forward Layer