3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker   Verification and Diarization

Yafeng Chen; Siqi Zheng; Hui Wang; Luyao Cheng; Tinglong Zhu; Rongjie; Huang; Chong Deng; Qian Chen; Shiliang Zhang; Wen Wang; Xihao Li

arXiv:2403.19971·eess.AS·December 30, 2024·ICASSP·1 cites

3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Tinglong Zhu, Rongjie, Huang, Chong Deng, Qian Chen, Shiliang Zhang, Wen Wang, Xihao Li

PDF

Open Access 2 Repos

TL;DR

3D-Speaker-Toolkit is an open-source multimodal framework combining acoustic, semantic, and visual data to enhance speaker verification and diarization accuracy in various environments.

Contribution

It introduces a comprehensive multimodal toolkit with integrated modules and a large dataset, setting a new benchmark for speaker analysis.

Findings

01

Improved accuracy in speaker verification and diarization.

02

Effective fusion of acoustic, semantic, and visual modalities.

03

Open-source toolkit with state-of-the-art models and large dataset.

Abstract

We introduce 3D-Speaker-Toolkit, an open-source toolkit for multimodal speaker verification and diarization, designed for meeting the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acoustic module extracts speaker embeddings from acoustic features, employing both fully-supervised and self-supervised learning approaches. The semantic module leverages advanced language models to comprehend the substance and context of spoken language, thereby augmenting the system's proficiency in distinguishing speakers through linguistic patterns. The visual module applies image processing technologies to scrutinize facial features, which bolsters the precision of speaker diarization in multi-speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis