Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

Aref Farhadipour; Teodora Vukovic; Volker Dellwo

arXiv:2512.02759·eess.AS·December 3, 2025

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

Aref Farhadipour, Teodora Vukovic, Volker Dellwo

PDF

Open Access

TL;DR

This paper presents a multimodal foundation model approach for cross-lingual face-voice association, achieving strong generalization in unseen languages and securing second place in the FAME2026 Challenge.

Contribution

It introduces a novel combination of foundation models with LoRA for multilingual face-voice verification, addressing data scarcity and language constraints.

Findings

01

ImageBind-LoRA achieved 24.73% EER on evaluation set

02

The approach demonstrated strong cross-lingual generalization

03

Secured 2nd place in the FAME2026 Challenge

Abstract

This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis