TL;DR
This paper introduces a self-supervised cross-modal embedding of face and voice for person identity, enabling retrieval and labeling in unseen scenarios, with a new benchmark and application to TV drama character identification.
Contribution
It presents a novel self-supervised learning method for face-voice embedding, a curriculum for hard negative mining, and demonstrates cross-modal retrieval for unseen identities, establishing a new benchmark.
Findings
Successful cross-modal retrieval without identity labels
Effective retrieval of unseen and unheard identities
Application to character labeling in TV dramas
Abstract
We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
