TL;DR
This paper introduces Mix-StAGE, a multi-speaker gesture style transfer model that learns unique speaker styles and generates natural co-speech gestures, advancing the ability of virtual agents to mimic diverse gesturing styles.
Contribution
The paper presents Mix-StAGE, a novel mixture of generative models that disentangles style and content for multi-speaker co-speech gesture generation and style transfer.
Findings
Mix-StAGE outperforms previous methods in gesture generation quality.
The model effectively preserves and transfers individual speaker styles.
A new dataset PATS supports multi-speaker gesture style research.
Abstract
How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
