Leveraging Speech for Gesture Detection in Multimodal Communication

Esam Ghaleb; Ilya Burenko; Marlou Rasenberg; Wim Pouw; Ivan Toni,; Peter Uhrig; Anna Wilson; Judith Holler; Asl{\i} \"Ozy\"urek; Raquel; Fern\'andez

arXiv:2404.14952·cs.CV·April 24, 2024·1 cites

Leveraging Speech for Gesture Detection in Multimodal Communication

Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Ivan Toni,, Peter Uhrig, Anna Wilson, Judith Holler, Asl{\i} \"Ozy\"urek, Raquel, Fern\'andez

PDF

Open Access 1 Repo

TL;DR

This paper presents a multimodal approach for co-speech gesture detection, integrating speech and visual data with Transformer models to improve accuracy and understand gesture-speech synchrony.

Contribution

It introduces a novel multimodal framework using Transformer encoders for co-speech gesture detection, addressing temporal misalignment and sampling rate issues.

Findings

01

Combining speech and visual data improves gesture detection accuracy.

02

Expanding speech buffers enhances detection performance.

03

Cross-modal and early fusion outperform unimodal and late fusion methods.

Abstract

Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

esamghaleb/bimodal-co-speech-gesture-detection
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Speech and dialogue systems

MethodsAttention Is All You Need · Sparse Evolutionary Training · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax