TSUP Speaker Diarization System for Conversational Short-phrase Speaker   Diarization Challenge

Bowen Pang; Huan Zhao; Gaosheng Zhang; Xiaoyue Yang; Yang Sun; Li; Zhang; Qing Wang; Lei Xie

arXiv:2210.14653·cs.SD·October 26, 2023·1 cites

TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge

Bowen Pang, Huan Zhao, Gaosheng Zhang, Xiaoyue Yang, Yang Sun, Li, Zhang, Qing Wang, Lei Xie

PDF

Open Access

TL;DR

This paper presents the TSUP speaker diarization system for the ISCSLP 2022 CSSD challenge, comparing spectral clustering, TS-VAD, and EEND approaches, with spectral clustering performing best under the new CDER metric.

Contribution

The paper introduces a comprehensive evaluation of three diarization methods on short-phrase conversations using a new metric, highlighting the effectiveness of spectral clustering and the impact of hyperparameter tuning.

Findings

01

Spectral clustering outperforms other methods under CDER.

02

Hyperparameter tuning significantly improves diarization accuracy.

03

Multi-system fusion with DOVER-LAP degrades performance.

Abstract

This paper describes the TSUP team's submission to the ISCSLP 2022 conversational short-phrase speaker diarization (CSSD) challenge which particularly focuses on short-phrase conversations with a new evaluation metric called conversational diarization error rate (CDER). In this challenge, we explore three kinds of typical speaker diarization systems, which are spectral clustering(SC) based diarization, target-speaker voice activity detection(TS-VAD) and end-to-end neural diarization(EEND) respectively. Our major findings are summarized as follows. First, the SC approach is more favored over the other two approaches under the new CDER metric. Second, tuning on hyperparameters is essential to CDER for all three types of speaker diarization systems. Specifically, CDER becomes smaller when the length of sub-segments setting longer. Finally, multi-system fusion through DOVER-LAP will worsen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing