Sequence-to-Sequence Predictive Model: From Prosody To Communicative Gestures
Fajrian Yunus, Chlo\'e Clavel, Catherine Pelachaud

TL;DR
This paper presents a neural network model that predicts the timing of communicative gestures based on speech acoustics, demonstrating its effectiveness and the importance of features like fundamental frequency and eyebrow movements.
Contribution
The study introduces a sequence-to-sequence neural model for gesture timing prediction from speech, including novel findings on feature relevance and cross-speaker applicability.
Findings
The model predicts certain gesture classes more accurately.
Fundamental frequency is a key feature for prediction.
Including eyebrow movements improves performance.
Abstract
Communicative gestures and speech acoustic are tightly linked. Our objective is to predict the timing of gestures according to the acoustic. That is, we want to predict when a certain gesture occurs. We develop a model based on a recurrent neural network with attention mechanism. The model is trained on a corpus of natural dyadic interaction where the speech acoustic and the gesture phases and types have been annotated. The input of the model is a sequence of speech acoustic and the output is a sequence of gesture classes. The classes we are using for the model output is based on a combination of gesture phases and gesture types. We use a sequence comparison technique to evaluate the model performance. We find that the model can predict better certain gesture classes than others. We also perform ablation studies which reveal that fundamental frequency is a relevant feature for gesture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
