ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Zhenjie Liu; Jianzhang Lu; Renjie Lu; Cong Liang; Shangfei Wang

arXiv:2511.06833·cs.CV·December 19, 2025

ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, Shangfei Wang

PDF

Open Access 1 Video

TL;DR

ConsistTalk is a novel framework for generating temporally consistent, controllable talking head videos with improved identity preservation and synchronization, utilizing optical flow guidance, an audio-to-intensity model, and a diffusion noise search strategy.

Contribution

It introduces a new optical flow-guided temporal module, an audio-to-intensity model via knowledge distillation, and a diffusion noise initialization strategy for enhanced video quality.

Findings

01

Reduces flickering and improves temporal consistency.

02

Enhances identity preservation in generated videos.

03

Achieves better audio-visual synchronization and motion control.

Abstract

Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search· underline

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI