ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Xiangyu Kong; Xiaoyu Jin; Yihan Pan; Haoqin Sun; Hengde Zhu; Xiaoming Xu; Xiaoming Wei; Lu Liu; Siyang Song

arXiv:2603.17427·cs.CV·March 19, 2026

ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu, Xiaoming Xu, Xiaoming Wei, Lu Liu, Siyang Song

PDF

Open Access

TL;DR

ECHO introduces a novel interactive head generation framework that models long-range context and emotional appropriateness, improving lifelike avatar facial behaviors and lip synchronization in face-to-face interaction simulations.

Contribution

The paper proposes ECHO, featuring long-range contextual understanding and a decoupled cross-attention module, to enhance contextual appropriateness and lip-sync accuracy in avatar head generation.

Findings

01

ECHO outperforms existing methods in contextual appropriateness.

02

ECHO achieves superior lip synchronization and visual fidelity.

03

Extensive experiments validate the effectiveness of ECHO's components.

Abstract

In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing