TL;DR
This paper introduces a new speech-driven gesture generation framework that leverages representation learning, improving motion dynamics and naturalness, and highlights the significance of post-processing in gesture synthesis.
Contribution
The paper extends deep-learning methods for gesture generation by analyzing input/output representations and the impact of post-processing, demonstrating improved naturalness and motion quality.
Findings
Improved motion dynamics and speed matching in generated gestures.
User studies show increased perceived naturalness of gestures.
Post-processing techniques like smoothing enhance gesture quality.
Abstract
This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyse the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
