EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
Yahui Li, Yinfeng Yu, Liejun Wang, Shengjie Shen

TL;DR
EAD-Net is a novel emotion-aware talking head generation framework that improves lip-sync, emotional expression, and temporal coherence in long videos using advanced spatio-temporal modeling and semantic guidance.
Contribution
The paper introduces a diffusion model-based network with SyncNet supervision, TREPA, STDA, TFRM, and semantic guidance to enhance expressive, coherent, and accurate talking head videos.
Findings
Outperforms existing methods in lip-sync accuracy.
Achieves better temporal consistency in long videos.
Enhances emotional expression accuracy.
Abstract
Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbf{E}motion-\textbf{A}ware \textbf{D}iffusion model-based \textbf{Net}work, called \textbf{EAD-Net}. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
