Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

Esam Ghaleb; Ilya Burenko; Marlou Rasenberg; Wim Pouw; Peter Uhrig,; Judith Holler; Ivan Toni; Asl{\i} \"Ozy\"urek; Raquel Fern\'andez

arXiv:2308.10680·cs.CV·April 30, 2024

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Peter Uhrig,, Judith Holler, Ivan Toni, Asl{\i} \"Ozy\"urek, Raquel Fern\'andez

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents a novel multi-phase sequence labeling framework for co-speech gesture detection, utilizing Transformer encoders and CRFs to better capture gesture dynamics over traditional binary classification methods.

Contribution

It introduces a new approach that models gesture phases as a sequence labeling problem, improving detection accuracy over existing binary classification methods.

Findings

01

Significant performance improvement over baseline models.

02

Transformer encoders enhance contextual understanding of gesture sequences.

03

Effective detection of gesture stroke phases in face-to-face dialogues.

Abstract

Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and retraction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EsamGhaleb/Multi-Phase-Gesture-Detection
pytorchOfficial

Videos

Co-Speech Gesture Detection Through Multi-Phase Sequence Labeling· youtube

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections