Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association
Wuyang Chen, Yanjie Sun, Kele Xu, Yong Dou

TL;DR
This paper presents a contrastive learning-based chaining-cluster method to improve face-voice association across multiple languages, addressing variability and outliers, and achieving high performance in the FAME 2024 challenge.
Contribution
It introduces a novel supervised cross-contrastive learning approach combined with a chaining-cluster post-processing step for multilingual face-voice association.
Findings
Achieved 2nd place in FAME 2024 challenge
Demonstrated robustness across multilingual data
Validated effectiveness through extensive experiments
Abstract
The innate correlation between a person's face and voice has recently emerged as a compelling area of study, especially within the context of multilingual environments. This paper introduces our novel solution to the Face-Voice Association in Multilingual Environments (FAME) 2024 challenge, focusing on a contrastive learning-based chaining-cluster method to enhance face-voice association. This task involves the challenges of building biometric relations between auditory and visual modality cues and modelling the prosody interdependence between different languages while addressing both intrinsic and extrinsic variability present in the data. To handle these non-trivial challenges, our method employs supervised cross-contrastive (SCC) learning to establish robust associations between voices and faces in multi-language scenarios. Following this, we have specifically designed a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Emotion and Mood Recognition · Social Robot Interaction and HRI
