Face-Voice Association with Inductive Bias for Maximum Class Separation
Marta Moscati, Oleksandr Kats, Mubashir Noman, Muhammad Zaigham Zaheer, Yufang Hou, Markus Schedl, Shah Nawaz

TL;DR
This paper introduces a novel method for face-voice association that applies maximum class separation as an inductive bias, significantly improving multimodal speaker representations and achieving state-of-the-art results.
Contribution
It is the first to apply maximum class separation as an inductive bias in face-voice association, enhancing discriminability of multimodal embeddings.
Findings
Achieves state-of-the-art performance on face-voice association tasks.
Imposing inductive bias combined with inter-class orthogonality losses improves results.
Demonstrates the effectiveness of maximum class separation in multimodal learning.
Abstract
Face-voice association is widely studied in multimodal learning and is approached representing faces and voices with embeddings that are close for a same person and well separated from those of others. Previous work achieved this with loss functions. Recent advancements in classification have shown that the discriminative ability of embeddings can be strengthened by imposing maximum class separation as inductive bias. This technique has never been used in the domain of face-voice association, and this work aims at filling this gap. More specifically, we develop a method for face-voice association that imposes maximum class separation among multimodal representations of different speakers as an inductive bias. Through quantitative experiments we demonstrate the effectiveness of our approach, showing that it achieves SOTA performance on two task formulation of face-voice association.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Emotion and Mood Recognition · Speech and Audio Processing
