ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search
Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, Shangfei Wang

TL;DR
ConsistTalk is a novel framework for generating temporally consistent, controllable talking head videos with improved identity preservation and synchronization, utilizing optical flow guidance, an audio-to-intensity model, and a diffusion noise search strategy.
Contribution
It introduces a new optical flow-guided temporal module, an audio-to-intensity model via knowledge distillation, and a diffusion noise initialization strategy for enhanced video quality.
Findings
Reduces flickering and improves temporal consistency.
Enhances identity preservation in generated videos.
Achieves better audio-visual synchronization and motion control.
Abstract
Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI
