ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head   Avatar with Temporal Guidance

Haijie Yang; Zhenyu Zhang; Hao Tang; Jianjun Qian; Jian Yang

arXiv:2411.15436·cs.CV·November 26, 2024

ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

PDF

Open Access

TL;DR

ConsistentAvatar introduces a novel diffusion-based framework that models temporal features to generate highly consistent and realistic talking head avatars, addressing previous issues of inconsistency and error accumulation.

Contribution

The paper proposes a temporally-sensitive diffusion approach that models and aligns high-frequency temporal features to improve consistency in talking head generation.

Findings

01

Outperforms state-of-the-art methods in appearance and temporal consistency

02

Effectively suppresses error accumulation over video frames

03

Produces high-fidelity, fully consistent talking head avatars

Abstract

Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI

MethodsDiffusion · ALIGN