Profile-Error-Tolerant Target-Speaker Voice Activity Detection

Dongmei Wang; Xiong Xiao; Naoyuki Kanda; Midia Yousefi; Takuya; Yoshioka; Jian Wu

arXiv:2309.12521·cs.SD·April 5, 2024

Profile-Error-Tolerant Target-Speaker Voice Activity Detection

Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya, Yoshioka, Jian Wu

PDF

Open Access

TL;DR

This paper introduces PET-TSVAD, a robust extension of TS-VAD that effectively handles errors in speaker profiles using transformer models and pseudo-speaker profiles, improving diarization accuracy.

Contribution

The paper proposes PET-TSVAD, a transformer-based TS-VAD extension that is tolerant to speaker profile errors and incorporates pseudo-speaker profiles for better diarization.

Findings

01

PET-TSVAD outperforms existing TS-VAD on VoxConverse.

02

The method is robust to speaker profile errors.

03

Training with diverse clustering algorithms reduces profile mismatch.

Abstract

Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing