Fuse after Align: Improving Face-Voice Association Learning via   Multimodal Encoder

Chong Peng; Liqiang He; Dan Su

arXiv:2404.09509·cs.CV·April 16, 2024·1 cites

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Chong Peng, Liqiang He, Dan Su

PDF

Open Access

TL;DR

This paper proposes a novel multimodal encoder framework that enhances voice-face association learning by leveraging implicit embedding information and an effective pair selection method, achieving state-of-the-art results in multiple tasks.

Contribution

It introduces a new unsupervised framework with a multimodal encoder and pair selection to improve voice-face association learning beyond traditional similarity measures.

Findings

01

Achieves state-of-the-art results in voice-face matching, verification, and retrieval.

02

Improves verification accuracy by approximately 3%.

03

Enhances matching and retrieval performance by about 2.5% and 1.3% respectively.

Abstract

Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsContrastive Learning