TL;DR
This paper presents a novel full-duplex interactive avatar generation model that processes dual audio streams, leveraging a multi-head Gaussian kernel to improve temporal dynamics and achieve more natural human-like interactions.
Contribution
It introduces a new temporal kernel and a dual-stream audio processing framework for more realistic and responsive conversational virtual agents.
Findings
Achieves state-of-the-art naturalness in full-duplex avatar interactions.
Successfully incorporates long-range conversational context without lip-sync degradation.
Introduces a new dataset VoxHear with decoupled speech and background audio.
Abstract
Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
