MFR-Net: Multi-faceted Responsive Listening Head Generation via   Denoising Diffusion Model

Jin Liu; Xi Wang; Xiaomeng Fu; Yesheng Chai; Cai Yu; Jiao Dai; Jizhong; Han

arXiv:2308.16635·cs.CV·September 1, 2023

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong, Han

PDF

Open Access

TL;DR

This paper introduces MFR-Net, a diffusion model-based system for generating responsive listening head videos that reflect speaker attitude, viewpoint, and listener identity, addressing a largely overlooked aspect of face-to-face communication modeling.

Contribution

MFR-Net is the first to generate diverse, multi-faceted listening head videos that respond to speaker cues while preserving listener identity using a diffusion model.

Findings

01

Achieves diverse responses in attitude and viewpoint.

02

Maintains high accuracy in listener identity.

03

Outperforms existing methods in response diversity and identity preservation.

Abstract

Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis

MethodsFocus · Diffusion