Learning Branched Fusion and Orthogonal Projection for Face-Voice Association
Muhammad Saad Saeed, Shah Nawaz, Muhammad Haris Khan, Sajid Javed,, Muhammad Haroon Yousaf, Alessio Del Bue

TL;DR
This paper introduces a novel fusion and orthogonal projection (FOP) framework that enhances face-voice association by creating enriched, discriminative embeddings using a lightweight, two-stream network with orthogonality constraints, outperforming existing methods.
Contribution
The paper proposes a new FOP mechanism that effectively fuses face and voice features and enforces orthogonality, improving discriminative power and efficiency over prior metric learning approaches.
Findings
FOP outperforms state-of-the-art on VoxCeleb1 and MAV-Celeb datasets.
The method is more efficient and effective than existing supervision techniques.
Cross-lingual analysis shows robustness of face-voice association across languages.
Abstract
Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that an enriched representation coupled with an effective yet efficient supervision is important towards realizing a discriminative joint embedding space for face-voice association tasks. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
