Profile-Error-Tolerant Target-Speaker Voice Activity Detection
Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya, Yoshioka, Jian Wu

TL;DR
This paper introduces PET-TSVAD, a robust extension of TS-VAD that effectively handles errors in speaker profiles using transformer models and pseudo-speaker profiles, improving diarization accuracy.
Contribution
The paper proposes PET-TSVAD, a transformer-based TS-VAD extension that is tolerant to speaker profile errors and incorporates pseudo-speaker profiles for better diarization.
Findings
PET-TSVAD outperforms existing TS-VAD on VoxConverse.
The method is robust to speaker profile errors.
Training with diverse clustering algorithms reduces profile mismatch.
Abstract
Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
