GIST-AiTeR System for the Diarization Task of the 2022 VoxCeleb Speaker   Recognition Challenge

Dongkeon Park; Yechan Yu; Kyeong Wan Park; Ji Won Kim; Hong Kook; Kim

arXiv:2209.10357·eess.AS·October 7, 2022

GIST-AiTeR System for the Diarization Task of the 2022 VoxCeleb Speaker Recognition Challenge

Dongkeon Park, Yechan Yu, Kyeong Wan Park, Ji Won Kim, Hong Kook, Kim

PDF

Open Access

TL;DR

The GIST-AiTeR system for the 2022 VoxCeleb challenge combines multiple speech processing models into an ensemble, achieving a low diarization error rate and ranking third.

Contribution

This paper introduces an ensemble diarization system integrating speech enhancement, VAD, multi-scaled embeddings, and overlapped speech detection for speaker diarization.

Findings

01

Achieved a diarization error rate of 5.12% on the challenge dataset.

02

Constructed four different diarization systems with various model combinations.

03

Ranked third in the VoxCeleb Speaker Recognition Challenge.

Abstract

This report describes the submission system of the GIST-AiTeR team at the 2022 VoxCeleb Speaker Recognition Challenge (VoxSRC) Track 4. Our system mainly includes speech enhancement, voice activity detection , multi-scaled speaker embedding, probabilistic linear discriminant analysis-based speaker clustering, and overlapped speech detection models. We first construct four different diarization systems according to different model combinations with the best experimental efforts. Our final submission is an ensemble system of all the four systems and achieves a diarization error rate of 5.12% on the challenge evaluation set, ranked third at the diarization track of the challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing