TL;DR
This paper introduces a real-time, data-driven system that generates speaker-specific gestures from speech using GANs, enabling natural virtual human interactions with minimal delay.
Contribution
It presents a novel GAN-based approach trained on online video data to synthesize real-time, speaker-specific gestures directly from speech audio.
Findings
Achieves gesture generation with less than three seconds delay
Utilizes large-scale online video data for training
Generates natural, speaker-specific gestures in real-time
Abstract
We propose a real-time system for synthesizing gestures directly from speech. Our data-driven approach is based on Generative Adversarial Neural Networks to model the speech-gesture relationship. We utilize the large amount of speaker video data available online to train our 3D gesture model. Our model generates speaker-specific gestures by taking consecutive audio input chunks of two seconds in length. We animate the predicted gestures on a virtual avatar. We achieve a delay below three seconds between the time of audio input and gesture animation. Code and videos are available at https://github.com/mrebol/Gestures-From-Speech
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
