Stable Video Portraits
Mirela Ostrek, Justus Thies

TL;DR
SVP is a hybrid 2D/3D generative method that creates photorealistic, controllable talking face videos using a fine-tuned diffusion model guided by 3D face models, enabling realistic and editable avatars.
Contribution
Introduces SVP, a novel hybrid 2D/3D approach combining stable diffusion with 3D face models for realistic, controllable video portrait generation without fine-tuning at test time.
Findings
Outperforms state-of-the-art monocular head avatar methods.
Produces temporally smooth, controllable talking face videos.
Enables editing and morphing of facial appearance based on text descriptions.
Abstract
Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsItalian Fascism and Post-war Society
MethodsDiffusion
