MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model
Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong, Han

TL;DR
This paper introduces MFR-Net, a diffusion model-based system for generating responsive listening head videos that reflect speaker attitude, viewpoint, and listener identity, addressing a largely overlooked aspect of face-to-face communication modeling.
Contribution
MFR-Net is the first to generate diverse, multi-faceted listening head videos that respond to speaker cues while preserving listener identity using a diffusion model.
Findings
Achieves diverse responses in attitude and viewpoint.
Maintains high accuracy in listener identity.
Outperforms existing methods in response diversity and identity preservation.
Abstract
Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
MethodsFocus · Diffusion
