Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Zhenhui Ye; Tianyun Zhong; Yi Ren; Jiaqi Yang; Weichuang Li; Jiawei; Huang; Ziyue Jiang; Jinzheng He; Rongjie Huang; Jinglin Liu; Chen Zhang,; Xiang Yin; Zejun Ma; Zhou Zhao

arXiv:2401.08503·cs.CV·March 26, 2024·1 cites

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei, Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang,, Xiang Yin, Zejun Ma, Zhou Zhao

PDF

Open Access 1 Repo 4 Models 3 Reviews

TL;DR

Real3D-Portrait introduces a comprehensive framework for one-shot 3D talking portrait synthesis that enhances reconstruction accuracy, stabilizes animation, and adds natural torso and background rendering for more realistic videos.

Contribution

It combines large image-to-plane models, motion adapters, super-resolution techniques, and audio-driven models to improve 3D reconstruction, animation stability, and realism in talking portrait generation.

Findings

01

Outperforms previous methods in realism and generalization to unseen identities.

02

Successfully generates natural torso movements and switchable backgrounds.

03

Achieves stable, accurate, and realistic talking portrait videos.

Abstract

One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 5

Strengths

The paper addresses a fairly novel problem of single-shot joint 3D reconstruction and animation of facial images. Differently from prior work it seeks to inject the animation information directly into the 3D representation versus the dominant approach of animating in 2D first and then lifting into 3D. As the authors correctly point out animation in 3D versus 2D results is more correct handling of large head poses and less warping artifacts. So this is an important problem to address towards enab

Weaknesses

1. The method by design requires the canonicalization, i.e., removal of the source image's facial expression, for the driving expression to be successfully applied to it. This is because the PNCC code is derived purely from the target driving video/audio's expression and hence cannot contain the information to erase/neutralize the source image's facial expression. I think the proposed method achieves this canonicalization during the fine-tuning phase with the Celeb-V-HQ video dataset. However, i

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

1) The paper is well written. 2) The qualitative results on reanimation are good, even when driven by audio. 3) The quantitative results show that the proposed method out-performs prior art. 4) The background is rendered well and merges seamlessly with the foreground.

Weaknesses

1) Given that the overall architecture is very similar to HiDe-NeRF, it is unclear where the improvement of the proposed method is coming from. Is it because a pretrained Tri-plane is a better representation than the multi-resolution tri-plane features of HiDe-NeRF? It would be great if the authors could clarify this 2) The addiction of background and torso modelling, while important, is relatively incremental.

Reviewer 03Rating 10· strong accept, should be highlighted at the conferenceConfidence 4

Strengths

+ The proposed method is interesting and its pipeline has sufficient novelty, especially in terms of the combination of the large-scale image-to-plane backbone, the motion adapter and the Head-Torso-Background Super-Resolution model, which results in particularly realistic results. + The paper includes an in-depth experimental evaluation that provides sound evidence about the promising results of the proposed method. In more detail, the proposed method is compared with several recent SOTA meth

Weaknesses

- The paper has omitted citing some important related methods of the field: J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in BMVC, 2017. Ye, Z., Xia, M., Yi, R., Zhang, J., Lai, Y.-K., Huang, X., et al. (2022). Audio-driven talking face video generation with dynamic convolution kernels. IEEE Transactions on Multimedia. In addition, the method is based on the projected normalized coordinate code (PNCC) representation but it has not cited one of the most important works of the

Code & Models

Repositories

yerfor/Real3DPortrait
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing

MethodsFocus