Co-Speech Gesture Detection through Multi-Phase Sequence Labeling
Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Peter Uhrig,, Judith Holler, Ivan Toni, Asl{\i} \"Ozy\"urek, Raquel Fern\'andez

TL;DR
This paper presents a novel multi-phase sequence labeling framework for co-speech gesture detection, utilizing Transformer encoders and CRFs to better capture gesture dynamics over traditional binary classification methods.
Contribution
It introduces a new approach that models gesture phases as a sequence labeling problem, improving detection accuracy over existing binary classification methods.
Findings
Significant performance improvement over baseline models.
Transformer encoders enhance contextual understanding of gesture sequences.
Effective detection of gesture stroke phases in face-to-face dialogues.
Abstract
Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and retraction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Co-Speech Gesture Detection Through Multi-Phase Sequence Labeling· youtube
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections
