Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face   Association

Wuyang Chen; Yanjie Sun; Kele Xu; Yong Dou

arXiv:2408.02025·cs.SD·August 20, 2024

Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

Wuyang Chen, Yanjie Sun, Kele Xu, Yong Dou

PDF

Open Access 1 Repo

TL;DR

This paper presents a contrastive learning-based chaining-cluster method to improve face-voice association across multiple languages, addressing variability and outliers, and achieving high performance in the FAME 2024 challenge.

Contribution

It introduces a novel supervised cross-contrastive learning approach combined with a chaining-cluster post-processing step for multilingual face-voice association.

Findings

01

Achieved 2nd place in FAME 2024 challenge

02

Demonstrated robustness across multilingual data

03

Validated effectiveness through extensive experiments

Abstract

The innate correlation between a person's face and voice has recently emerged as a compelling area of study, especially within the context of multilingual environments. This paper introduces our novel solution to the Face-Voice Association in Multilingual Environments (FAME) 2024 challenge, focusing on a contrastive learning-based chaining-cluster method to enhance face-voice association. This task involves the challenges of building biometric relations between auditory and visual modality cues and modelling the prosody interdependence between different languages while addressing both intrinsic and extrinsic variability present in the data. To handle these non-trivial challenges, our method employs supervised cross-contrastive (SCC) learning to establish robust associations between voices and faces in multi-language scenarios. Following this, we have specifically designed a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

colaudiolab/fame24_solution
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Emotion and Mood Recognition · Social Robot Interaction and HRI