INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Yongming Zhu; Longhao Zhang; Zhengkun Rong; Tianshu Hu; Shuang Liang,; Zhipeng Ge

arXiv:2412.04037·cs.CV·December 6, 2024

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang,, Zhipeng Ge

PDF

Open Access

TL;DR

This paper introduces INFP, a novel framework for audio-driven interactive head generation in dyadic conversations, enabling realistic, dynamic agent behaviors that switch between speaking and listening based on audio cues.

Contribution

We propose INFP, a new head generation model that dynamically alternates between speaking and listening states guided by dyadic audio, and introduce DyConv, a large-scale conversational dataset.

Findings

01

Outperforms existing head generation methods in realism and interactivity.

02

Effectively models dynamic role switching in dyadic conversations.

03

Demonstrates superior performance through extensive experiments.

Abstract

Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Human Motion and Animation · Language, Metaphor, and Cognition

MethodsFocus