XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association

Zhihua Fang; Shumei Tao; Junxu Wang; Liang He

arXiv:2512.06757·cs.SD·April 29, 2026

XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association

Zhihua Fang, Shumei Tao, Junxu Wang, Liang He

PDF

1 Repo

TL;DR

XM-ALIGN is a unified framework for cross-modal face-voice embedding alignment that improves verification performance through joint optimization and data augmentation, demonstrated on MAV-Celeb.

Contribution

The paper presents a novel unified cross-modal embedding alignment framework combining explicit and implicit mechanisms for face-voice association.

Findings

01

Superior performance on MAV-Celeb dataset

02

Effective joint optimization of face and voice embeddings

03

Enhanced generalization with data augmentation

Abstract

This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both "heard" and "unheard" languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at https://github.com/PunkMale/XM-ALIGN.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PunkMale/XM-ALIGN
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.