Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks   for Talking Head Video Generation

Shuling Zhao; Fa-Ting Hong; Xiaoshui Huang; Dan Xu

arXiv:2412.00719·cs.CV·March 26, 2025

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, Dan Xu

PDF

Open Access

TL;DR

This paper introduces a unified multi-scale codebook approach with transformer-based compensation to improve the realism and detail of talking head videos, effectively modeling facial motion and appearance.

Contribution

It proposes a novel joint learning framework of motion and appearance codebooks with multi-scale compensation for enhanced talking head video generation.

Findings

01

Outperforms state-of-the-art methods in qualitative assessments.

02

Achieves higher accuracy in facial motion and appearance preservation.

03

Produces videos with fewer distortions and more realistic details.

Abstract

Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Human Motion and Animation · Speech and Audio Processing