GestSync: Determining who is speaking without a talking head

Sindhu B Hegde; Andrew Zisserman

arXiv:2310.05304·cs.CV·October 10, 2023

GestSync: Determining who is speaking without a talking head

Sindhu B Hegde, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

Gesture-Sync introduces a new task to determine if gestures are correlated with speech, employing a dual-encoder model trained self-supervised, with applications in speaker identification and audio-visual synchronization.

Contribution

The paper presents a novel Gesture-Sync task, a dual-encoder model, and demonstrates its effectiveness using self-supervised learning on the LRS3 dataset.

Findings

01

Model successfully detects gesture-speech correlation

02

Self-supervised training achieves competitive performance

03

Applications include speaker identification without face visibility

Abstract

In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audio-visual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Sindhu-Hegde/gestsync
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hand Gesture Recognition Systems · Face recognition and analysis