IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime   Talking Head Generation

Sejong Yang; Seoung Wug Oh; Yang Zhou; Seon Joo Kim

arXiv:2412.04000·cs.CV·December 11, 2024

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

Sejong Yang, Seoung Wug Oh, Yang Zhou, Seon Joo Kim

PDF

Open Access

TL;DR

This paper presents IF-MDM, a real-time high-resolution talking head generation model using implicit motion diffusion, which improves visual quality and controllability over existing methods while maintaining fast processing speeds.

Contribution

The paper introduces a novel implicit face motion diffusion model that encodes appearance-aware facial latents and incorporates motion statistics for fine-grained control, enabling high-fidelity real-time video synthesis.

Findings

01

Supports 512x512 videos at 45 fps

02

Outperforms existing diffusion and explicit face models

03

Provides controllable motion quality during inference

Abstract

We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing

MethodsDiffusion