GAIA: Zero-shot Talking Avatar Generation
Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai, An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang, Bian

TL;DR
GAIA is a scalable, domain-agnostic framework for zero-shot talking avatar generation that produces more natural, diverse, and high-quality talking videos from speech and a single portrait image, surpassing previous methods.
Contribution
This work introduces GAIA, a novel domain-prior-free approach for talking avatar generation, leveraging large-scale training and disentangled representations for improved performance.
Findings
Outperforms previous models in naturalness, diversity, and lip-sync quality.
Larger models yield better results, demonstrating scalability.
Enables controllable and text-instructed avatar generation.
Abstract
Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image. We collect a large-scale high-quality talking avatar dataset and…
Peer Reviews
Decision·ICLR 2024 poster
- Impressive quality of the results in terms of both lipsync and visual quality - The model's design is straightforward yet evidently effective - The paper is well-written and the evaluation is pretty extensive
- Missing evaluation of disentanglement between appearance and pose latent codes, i.e., cross-reenactment with the motion codes extracted from the image of a different identity. - Missing discussion of the related works, such as [1, 2], that explored the concept of pose-identity disentanglement for talking head synthesis before this work. - As far as I can tell, the proposed method and the baselines were trained on different datasets. The resulting comparison evaluates the proposed framework _an
The manuscript proposes a new dataset. The writing is supported by equations and well-drawn figures that make the explanation clear. Although the experiments with existing models are not enough (see weaknesses), the ablation study is rich and increases the overall quality.
The gap/ limitations of 3DMM-based models are found and addressed well by proposing an end-to-end trainable model. I am not sure it is novel enough as the other end-to-end trainable talking face synthesis models are not discussed enough. The experiments are limited, especially comparison with end-to-end trainable models not provided. I suggest enriching the benchmarking with other existing models such as PC-AVS and PD-FGC as they are also end-to-end trainable models. Although the writing qua
1) The method is conceptually simple and sensible. 2) It is shown to scale well in terms of model size and as a self-supervised method can utilize readily available training data at scale. 2) Method requires very few pretrained components. 3) Evaluation includes user study which is always good for addressing output quality. 4) Method is highly flexible and allows a high degree of control from pose, facial attributes and text.
1) A comparison with https://arxiv.org/pdf/2012.08261.pdf for video driven is critically missing as a recently proposed SOTA method. In their paper they show improvements compared to face-vid2vid and FOMM which are used as baselines here and they provide a pretrained checkpoint.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
