AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
Yuxin Lu, Jiayang Sun, Guibo Zhu, Min Cao

TL;DR
AsymTalker is a diffusion-based method that ensures long-term, identity-consistent talking head videos by addressing temporal misalignment and identity drift through novel encoding and distillation techniques.
Contribution
It introduces Temporal Reference Encoding and Asymmetric Knowledge Distillation to improve long-term coherence and identity preservation in talking head generation.
Findings
Achieves state-of-the-art results on HDTF and VFHQ datasets.
Guarantees high-fidelity, identity-consistent videos over 600 seconds.
Operates at a real-time speed of 66 FPS.
Abstract
Diffusion-based talking head generation has achieved remarkable visual quality, yet scaling it to long-term videos remains challenging. The widely adopted chunk-wise paradigm introduces two fundamental failures: (1) temporal-spatial misalignment between static identity references and dynamic audio streams, and (2) cascading identity drift propagated through self-generated continuity references across chunks. To address both issues, we propose AsymTalker, a novel diffusion-based talking head generation method comprising Temporal Reference Encoding (TRE) and Asymmetric Knowledge Distillation (AKD). First, TRE mitigates temporal-spatial misalignment by transforming the static identity image into a temporally coherent latent representation through encoding of a temporally replicated pseudo-video, without introducing additional parameters. Second, AKD resolves the inherent conditioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
