Multi-target Extractor and Detector for Unknown-number Speaker   Diarization

Chin-Yi Cheng; Hung-Shin Lee; Yu Tsao; Hsin-Min Wang

arXiv:2203.16007·cs.SD·June 7, 2023

Multi-target Extractor and Detector for Unknown-number Speaker Diarization

Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a neural network architecture that extracts speaker representations and detects their presence in multi-speaker conversations, improving diarization accuracy across varying speaker counts.

Contribution

The proposed model uniquely combines speaker representation extraction and presence detection in a unified framework, handling unknown numbers of speakers effectively.

Findings

01

Outperforms existing methods on CALLHOME corpus

02

Achieves 6.4% to 30.9% relative diarization error rate reduction

03

Effective in scenarios with 2 to 7 simultaneous speakers

Abstract

Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chinyi0523/mtead
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing