Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder
Chong Peng, Liqiang He, Dan Su

TL;DR
This paper proposes a novel multimodal encoder framework that enhances voice-face association learning by leveraging implicit embedding information and an effective pair selection method, achieving state-of-the-art results in multiple tasks.
Contribution
It introduces a new unsupervised framework with a multimodal encoder and pair selection to improve voice-face association learning beyond traditional similarity measures.
Findings
Achieves state-of-the-art results in voice-face matching, verification, and retrieval.
Improves verification accuracy by approximately 3%.
Enhances matching and retrieval performance by about 2.5% and 1.3% respectively.
Abstract
Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsContrastive Learning
