Portrait Video Editing Empowered by Multimodal Generative Priors
Xuan Gao, Haiyao Xiao, Chenglai Zhong, Shimin Hu, Yudong Guo, Juyong, Zhang

TL;DR
PortraitGen is a novel portrait video editing method that ensures 3D and temporal consistency, high-quality stylization, and real-time rendering by leveraging multimodal generative priors and a unified 3D Gaussian field.
Contribution
The paper introduces a unified 3D Gaussian field and Neural Gaussian Texture for fast, consistent, and expressive portrait video editing with multimodal prompts, addressing limitations of previous methods.
Findings
Achieves over 100FPS rendering speed.
Demonstrates superior temporal consistency and stylization quality.
Supports diverse editing applications including text-driven and relighting.
Abstract
We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts. Traditional portrait video editing methods often struggle with 3D and temporal consistency, and typically lack in rendering quality and efficiency. To address these issues, we lift the portrait video frames to a unified dynamic 3D Gaussian field, which ensures structural and temporal coherence across frames. Furthermore, we design a novel Neural Gaussian Texture mechanism that not only enables sophisticated style editing but also achieves rendering speed over 100FPS. Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models. Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Digital Storytelling and Education
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
