FLAP: Fully-controllable Audio-driven Portrait Video Generation through   3D head conditioned diffusion model

Lingzhou Mu; Baiji Liu; Ruonan Zhang; Guiming Mo; Jiawei Jin; Kai; Zhang; Haozhi Huang

arXiv:2502.19455·cs.GR·April 24, 2025

FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion model

Lingzhou Mu, Baiji Liu, Ruonan Zhang, Guiming Mo, Jiawei Jin, Kai, Zhang, Haozhi Huang

PDF

Open Access

TL;DR

FLAP is a novel diffusion-based method that enables fully controllable, realistic audio-driven portrait video generation with independent control over head pose and facial expressions, suitable for practical applications.

Contribution

Introducing FLAP, which integrates explicit 3D parameters into diffusion models for end-to-end controllable portrait video synthesis from audio.

Findings

01

Outperforms recent models in naturalness and controllability

02

Allows independent manipulation of head pose and facial expressions

03

Demonstrates flexibility with existing 3D head generation methods

Abstract

Diffusion-based video generation techniques have significantly improved zero-shot talking-head avatar generation, enhancing the naturalness of both head motion and facial expressions. However, existing methods suffer from poor controllability, making them less applicable to real-world scenarios such as filmmaking and live streaming for e-commerce. To address this limitation, we propose FLAP, a novel approach that integrates explicit 3D intermediate parameters (head poses and facial expressions) into the diffusion model for end-to-end generation of realistic portrait videos. The proposed architecture allows the model to generate vivid portrait videos from audio while simultaneously incorporating additional control signals, such as head rotation angles and eye-blinking frequency. Furthermore, the decoupling of head pose and facial expression allows for independent control of each,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation