Seeking the Shape of Sound: An Adaptive Framework for Learning   Voice-Face Association

Peisong Wen; Qianqian Xu; Yangbangyan Jiang; Zhiyong Yang; Yuan He and; Qingming Huang

arXiv:2103.07293·cs.CV·March 15, 2021

Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association

Peisong Wen, Qianqian Xu, Yangbangyan Jiang, Zhiyong Yang, Yuan He and, Qingming Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an adaptive framework for learning voice-face associations that incorporates global and local modality alignment, along with a dynamic reweighting scheme to handle diverse learning difficulties, improving performance across multiple tasks.

Contribution

It proposes a novel two-level modality alignment loss with global and local information, and a dynamic reweighting scheme to address learning difficulty diversity.

Findings

01

Outperforms previous methods in voice-face matching, verification, and retrieval.

02

Incorporates global identity classification to enhance embedding separation.

03

Effectively filters unlearnable identities, improving learning efficiency.

Abstract

Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computer vision community. However, most of the prior arts along this line (a) merely adopt local information to perform modality alignment and (b) ignore the diversity of learning difficulty across different subjects. In this paper, we propose a novel framework to jointly address the above-mentioned issues. Targeting at (a), we propose a two-level modality alignment loss where both global and local information are considered. Compared with the existing methods, we introduce a global loss into the modality alignment process. The global component of the loss is driven by the identity classification. Theoretically, we show that minimizing the loss could maximize the distance between embeddings across different identities while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KID-7391/seeking-the-shape-of-sound
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis