Towards Language-Independent Face-Voice Association with Multimodal Foundation Models
Aref Farhadipour, Teodora Vukovic, Volker Dellwo

TL;DR
This paper presents a multimodal foundation model approach for cross-lingual face-voice association, achieving strong generalization in unseen languages and securing second place in the FAME2026 Challenge.
Contribution
It introduces a novel combination of foundation models with LoRA for multilingual face-voice verification, addressing data scarcity and language constraints.
Findings
ImageBind-LoRA achieved 24.73% EER on evaluation set
The approach demonstrated strong cross-lingual generalization
Secured 2nd place in the FAME2026 Challenge
Abstract
This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
