TL;DR
PersonaGesture introduces a diffusion-based method for personalized co-speech gesture synthesis for unseen speakers using only a single reference clip, improving style retention without per-speaker training.
Contribution
The paper presents a novel diffusion pipeline with Adaptive Style Infusion and Implicit Distribution Rectification for effective unseen speaker gesture personalization from minimal reference data.
Findings
Outperforms existing methods in unseen speaker personalization metrics.
Effectively separates speaker style from utterance-specific gestures.
Achieves high human preference scores in qualitative evaluations.
Abstract
We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. PersonaGesture consists of two key components, Adaptive Style Infusion (ASI) and Implicit Distribution Rectification (IDR), to separate temporal identity evidence from residual statistic correction. A Style Perceiver first encodes the variable-length reference into compact speaker-memory tokens. ASI injects these tokens into denoising through zero-initialized residual cross-attention, enabling style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
