One Shot, One Talk: Whole-body Talking Avatar from a Single Image
Jun Xiang, Yudong Guo, Leipeng Hu, Boyang Guo, Yancheng Yuan, Juyong, Zhang

TL;DR
This paper introduces a novel pipeline for creating realistic, animatable whole-body talking avatars from a single image, addressing dynamic modeling and generalization challenges with a hybrid 3D mesh representation and diffusion models.
Contribution
It presents a new method combining pose-guided diffusion and a 3D mesh hybrid model to generate and animate avatars from a single image, improving realism and control.
Findings
Enables creation of photorealistic avatars from one image
Achieves precise animation of gestures and expressions
Demonstrates robustness across diverse subjects
Abstract
Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVirtual Reality Applications and Impacts · Augmented Reality Applications
MethodsDiffusion
