Multi-Stage Face-Voice Association Learning with Keynote Speaker   Diarization

Ruijie Tao; Zhan Shi; Yidi Jiang; Duc-Tuan Truong; Eng-Siong Chng,; Massimo Alioto; Haizhou Li

arXiv:2407.17902·eess.AS·July 26, 2024·ACM Multimedia

Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

Ruijie Tao, Zhan Shi, Yidi Jiang, Duc-Tuan Truong, Eng-Siong Chng,, Massimo Alioto, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces MFV-KSD, a multi-stage framework for cross-modal face-voice association that effectively handles noisy inputs and improves inter-modal correlation, achieving top performance in a multilingual challenge.

Contribution

The paper proposes a novel three-stage training strategy and a keynote speaker diarization front-end for robust face-voice association learning.

Findings

01

Achieved first place in the FAME challenge with 19.9% EER.

02

Demonstrated robustness in multilingual environments.

03

Enhanced intra-modal and inter-modal feature learning.

Abstract

The human brain has the capability to associate the unknown person's voice and face by leveraging their general relationship, referred to as ``cross-modal speaker verification''. This task poses significant challenges due to the complex relationship between the modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framework. MFV-KSD contains a keynote speaker diarization front-end to effectively address the noisy speech inputs issue. To balance and enhance the intra-modal feature learning and inter-modal correlation understanding, MFV-KSD utilizes a novel three-stage training strategy. Our experimental results demonstrated robust performance, achieving the first rank in the 2024 Face-voice Association in Multilingual Environments (FAME) challenge with an overall Equal Error Rate (EER) of 19.9%. Details can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taoruijie/mfv-ksd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis