ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation
Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu, Xiaoming Xu, Xiaoming Wei, Lu Liu, Siyang Song

TL;DR
ECHO introduces a novel interactive head generation framework that models long-range context and emotional appropriateness, improving lifelike avatar facial behaviors and lip synchronization in face-to-face interaction simulations.
Contribution
The paper proposes ECHO, featuring long-range contextual understanding and a decoupled cross-attention module, to enhance contextual appropriateness and lip-sync accuracy in avatar head generation.
Findings
ECHO outperforms existing methods in contextual appropriateness.
ECHO achieves superior lip synchronization and visual fidelity.
Extensive experiments validate the effectiveness of ECHO's components.
Abstract
In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
