Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

Yuzhe Weng; Haotian Wang; Xinyi Yu; Xiaoyan Wu; Haoran Xu; Shan He; Jun Du

arXiv:2604.10367·cs.AI·April 14, 2026

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

Yuzhe Weng, Haotian Wang, Xinyi Yu, Xiaoyan Wu, Haoran Xu, Shan He, Jun Du

PDF

1 Repo

TL;DR

This paper presents a novel full-duplex interactive avatar generation model that processes dual audio streams, leveraging a multi-head Gaussian kernel to improve temporal dynamics and achieve more natural human-like interactions.

Contribution

It introduces a new temporal kernel and a dual-stream audio processing framework for more realistic and responsive conversational virtual agents.

Findings

01

Achieves state-of-the-art naturalness in full-duplex avatar interactions.

02

Successfully incorporates long-range conversational context without lip-sync degradation.

03

Introduces a new dataset VoxHear with decoupled speech and background audio.

Abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://warmcongee.github.io/beyond-monologue
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.