INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang,, Zhipeng Ge

TL;DR
This paper introduces INFP, a novel framework for audio-driven interactive head generation in dyadic conversations, enabling realistic, dynamic agent behaviors that switch between speaking and listening based on audio cues.
Contribution
We propose INFP, a new head generation model that dynamically alternates between speaking and listening states guided by dyadic audio, and introduce DyConv, a large-scale conversational dataset.
Findings
Outperforms existing head generation methods in realism and interactivity.
Effectively models dynamic role switching in dyadic conversations.
Demonstrates superior performance through extensive experiments.
Abstract
Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Human Motion and Animation · Language, Metaphor, and Cognition
MethodsFocus
