Shared Multi-modal Embedding Space for Face-Voice Association

Christopher Simic; Korbinian Riedhammer; Tobias Bocklet

arXiv:2512.04814·cs.SD·December 5, 2025

Shared Multi-modal Embedding Space for Face-Voice Association

Christopher Simic, Korbinian Riedhammer, Tobias Bocklet

PDF

Open Access

TL;DR

This paper introduces a multi-modal embedding approach for face-voice association, effectively handling multilingual and unseen language scenarios, and achieves top performance in the FAME 2026 challenge.

Contribution

It presents a novel shared embedding space with separate uni-modal pipelines and age-gender features, trained using an Adaptive Angular Margin loss for face-voice matching.

Findings

01

Achieved first place in the FAME 2026 challenge.

02

Attained an average EER of 23.99%.

03

Demonstrated effectiveness in multilingual and unseen language settings.

Abstract

The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Speech Recognition and Synthesis