Learnable PINs: Cross-Modal Embeddings for Person Identity

Arsha Nagrani; Samuel Albanie; Andrew Zisserman

arXiv:1805.00833·cs.CV·July 27, 2018

Learnable PINs: Cross-Modal Embeddings for Person Identity

Arsha Nagrani, Samuel Albanie, Andrew Zisserman

PDF

1 Repo

TL;DR

This paper introduces a self-supervised cross-modal embedding of face and voice for person identity, enabling retrieval and labeling in unseen scenarios, with a new benchmark and application to TV drama character identification.

Contribution

It presents a novel self-supervised learning method for face-voice embedding, a curriculum for hard negative mining, and demonstrates cross-modal retrieval for unseen identities, establishing a new benchmark.

Findings

01

Successful cross-modal retrieval without identity labels

02

Effective retrieval of unseen and unheard identities

03

Application to character labeling in TV dramas

Abstract

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

my-yy/learnable_pins
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.