Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei, Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang,, Xiang Yin, Zejun Ma, Zhou Zhao

TL;DR
Real3D-Portrait introduces a comprehensive framework for one-shot 3D talking portrait synthesis that enhances reconstruction accuracy, stabilizes animation, and adds natural torso and background rendering for more realistic videos.
Contribution
It combines large image-to-plane models, motion adapters, super-resolution techniques, and audio-driven models to improve 3D reconstruction, animation stability, and realism in talking portrait generation.
Findings
Outperforms previous methods in realism and generalization to unseen identities.
Successfully generates natural torso movements and switchable backgrounds.
Achieves stable, accurate, and realistic talking portrait videos.
Abstract
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable…
Peer Reviews
Decision·ICLR 2024 spotlight
The paper addresses a fairly novel problem of single-shot joint 3D reconstruction and animation of facial images. Differently from prior work it seeks to inject the animation information directly into the 3D representation versus the dominant approach of animating in 2D first and then lifting into 3D. As the authors correctly point out animation in 3D versus 2D results is more correct handling of large head poses and less warping artifacts. So this is an important problem to address towards enab
1. The method by design requires the canonicalization, i.e., removal of the source image's facial expression, for the driving expression to be successfully applied to it. This is because the PNCC code is derived purely from the target driving video/audio's expression and hence cannot contain the information to erase/neutralize the source image's facial expression. I think the proposed method achieves this canonicalization during the fine-tuning phase with the Celeb-V-HQ video dataset. However, i
1) The paper is well written. 2) The qualitative results on reanimation are good, even when driven by audio. 3) The quantitative results show that the proposed method out-performs prior art. 4) The background is rendered well and merges seamlessly with the foreground.
1) Given that the overall architecture is very similar to HiDe-NeRF, it is unclear where the improvement of the proposed method is coming from. Is it because a pretrained Tri-plane is a better representation than the multi-resolution tri-plane features of HiDe-NeRF? It would be great if the authors could clarify this 2) The addiction of background and torso modelling, while important, is relatively incremental.
+ The proposed method is interesting and its pipeline has sufficient novelty, especially in terms of the combination of the large-scale image-to-plane backbone, the motion adapter and the Head-Torso-Background Super-Resolution model, which results in particularly realistic results. + The paper includes an in-depth experimental evaluation that provides sound evidence about the promising results of the proposed method. In more detail, the proposed method is compared with several recent SOTA meth
- The paper has omitted citing some important related methods of the field: J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in BMVC, 2017. Ye, Z., Xia, M., Yi, R., Zhang, J., Lai, Y.-K., Huang, X., et al. (2022). Audio-driven talking face video generation with dynamic convolution kernels. IEEE Transactions on Multimedia. In addition, the method is based on the projected normalized coordinate code (PNCC) representation but it has not cited one of the most important works of the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing
MethodsFocus
