Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang,, Yuxing Peng

TL;DR
This paper introduces CMPC, a novel unsupervised learning method that improves voice-face representation by addressing false negatives and weak correlations through semantic clustering and prototype comparison.
Contribution
The paper proposes cross-modal prototype contrastive learning (CMPC), enhancing unsupervised voice-face representation by leveraging semantic clustering and dynamic prototype comparison.
Findings
Outperforms state-of-the-art unsupervised methods in voice-face association tasks.
Shows significant improvements in low-shot supervision scenarios.
Effectively resists false negatives and deviant positives in contrastive learning.
Abstract
We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Nasal Surgery and Airway Studies
MethodsContrastive Learning
