On Learning Associations of Faces and Voices
Changil Kim, Hijung Valentina Shin, Tae-Hyun Oh, Alexandre Kaspar,, Mohamed Elgharib, Wojciech Matusik

TL;DR
This study investigates how faces and voices are associated in humans and machines, demonstrating that learned cross-modal representations can match faces to voices with human-like accuracy and reveal demographic information.
Contribution
The paper introduces a new dataset and computational model that captures overlapping face-voice information, enabling accurate cross-modal identification and demographic correlation analysis.
Findings
Humans can associate unseen faces and voices above chance levels.
The model achieves face-voice matching performance comparable to humans.
The learned representations correlate with demographic attributes.
Abstract
In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multisensory perception and integration · Face recognition and analysis
