Leveraging Speech for Gesture Detection in Multimodal Communication
Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Ivan Toni,, Peter Uhrig, Anna Wilson, Judith Holler, Asl{\i} \"Ozy\"urek, Raquel, Fern\'andez

TL;DR
This paper presents a multimodal approach for co-speech gesture detection, integrating speech and visual data with Transformer models to improve accuracy and understand gesture-speech synchrony.
Contribution
It introduces a novel multimodal framework using Transformer encoders for co-speech gesture detection, addressing temporal misalignment and sampling rate issues.
Findings
Combining speech and visual data improves gesture detection accuracy.
Expanding speech buffers enhances detection performance.
Cross-modal and early fusion outperform unimodal and late fusion methods.
Abstract
Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Speech and dialogue systems
MethodsAttention Is All You Need · Sparse Evolutionary Training · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax
