TL;DR
PersonaGest is a two-stage framework that enhances co-speech gesture generation by explicitly disentangling semantic content and style, leading to more coherent and personalized gestures.
Contribution
It introduces a semantic-guided hierarchical motion representation and a contrastive learning approach for improved semantic coherence and personalization.
Findings
Achieves state-of-the-art results on objective metrics.
Demonstrates strong style consistency in user studies.
Effectively disentangles content and style for personalized gestures.
Abstract
Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
