Portrait Video Editing Empowered by Multimodal Generative Priors

Xuan Gao; Haiyao Xiao; Chenglai Zhong; Shimin Hu; Yudong Guo; Juyong; Zhang

arXiv:2409.13591·cs.CV·September 23, 2024

Portrait Video Editing Empowered by Multimodal Generative Priors

Xuan Gao, Haiyao Xiao, Chenglai Zhong, Shimin Hu, Yudong Guo, Juyong, Zhang

PDF

Open Access

TL;DR

PortraitGen is a novel portrait video editing method that ensures 3D and temporal consistency, high-quality stylization, and real-time rendering by leveraging multimodal generative priors and a unified 3D Gaussian field.

Contribution

The paper introduces a unified 3D Gaussian field and Neural Gaussian Texture for fast, consistent, and expressive portrait video editing with multimodal prompts, addressing limitations of previous methods.

Findings

01

Achieves over 100FPS rendering speed.

02

Demonstrates superior temporal consistency and stylization quality.

03

Supports diverse editing applications including text-driven and relighting.

Abstract

We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts. Traditional portrait video editing methods often struggle with 3D and temporal consistency, and typically lack in rendering quality and efficiency. To address these issues, we lift the portrait video frames to a unified dynamic 3D Gaussian field, which ensures structural and temporal coherence across frames. Furthermore, we design a novel Neural Gaussian Texture mechanism that not only enables sophisticated style editing but also achieves rendering speed over 100FPS. Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models. Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Digital Storytelling and Education

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings