Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow   Matching

Zhengyang Chen; Bing Han; Shuai Wang; Yidi Jiang; Yanmin Qian

arXiv:2409.04859·cs.SD·September 20, 2024

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching

Zhengyang Chen, Bing Han, Shuai Wang, Yidi Jiang, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces Flow-TSVAD, a novel generative approach for speaker diarization that maps label sequences into a dense latent space, enabling rapid inference, sampling, and improved performance over traditional discriminative methods.

Contribution

It pioneers the application of neural network-based generative methods to speaker diarization, proposing a latent flow matching technique within a sequence-to-sequence framework.

Findings

01

Flow-TSVAD outperforms Seq2Seq-TSVAD in diarization accuracy.

02

The model converges in only two inference steps.

03

Sampling and ensembling improve diarization results.

Abstract

Speaker diarization is typically considered a discriminative task, using discriminative approaches to produce fixed diarization results. In this paper, we explore the use of neural network-based generative methods for speaker diarization for the first time. We implement a Flow-Matching (FM) based generative algorithm within the sequence-to-sequence target speaker voice activity detection (Seq2Seq-TSVAD) diarization system. Our experiments reveal that applying the generative method directly to the original binary label sequence space of the TS-VAD output is ineffective. To address this issue, we propose mapping the binary label sequence into a dense latent space before applying the generative algorithm and our proposed Flow-TSVAD method outperforms the Seq2Seq-TSVAD system. Additionally, we observe that the FM algorithm converges rapidly during the inference stage, requiring only two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems