Learning Branched Fusion and Orthogonal Projection for Face-Voice   Association

Muhammad Saad Saeed; Shah Nawaz; Muhammad Haris Khan; Sajid Javed,; Muhammad Haroon Yousaf; Alessio Del Bue

arXiv:2208.10238·cs.CV·August 23, 2022·1 cites

Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

Muhammad Saad Saeed, Shah Nawaz, Muhammad Haris Khan, Sajid Javed,, Muhammad Haroon Yousaf, Alessio Del Bue

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel fusion and orthogonal projection (FOP) framework that enhances face-voice association by creating enriched, discriminative embeddings using a lightweight, two-stream network with orthogonality constraints, outperforming existing methods.

Contribution

The paper proposes a new FOP mechanism that effectively fuses face and voice features and enforces orthogonality, improving discriminative power and efficiency over prior metric learning approaches.

Findings

01

FOP outperforms state-of-the-art on VoxCeleb1 and MAV-Celeb datasets.

02

The method is more efficient and effective than existing supervision techniques.

03

Cross-lingual analysis shows robustness of face-voice association across languages.

Abstract

Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that an enriched representation coupled with an effective yet efficient supervision is important towards realizing a discriminative joint embedding space for face-voice association tasks. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

msaadsaeed/FOP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis