ReactDiff: Latent Diffusion for Facial Reaction Generation

Jiaming Li; Sheng Wang; Xin Wang; Yitao Zhu; Honglin Xiong; Zixu Zhuang; Qian Wang

arXiv:2505.14151·cs.CV·June 5, 2025

ReactDiff: Latent Diffusion for Facial Reaction Generation

Jiaming Li, Sheng Wang, Xin Wang, Yitao Zhu, Honglin Xiong, Zixu Zhuang, Qian Wang

PDF

Open Access 1 Repo

TL;DR

ReactDiff introduces a novel latent diffusion framework with multi-modal transformer integration for generating diverse, contextually appropriate facial reactions from audio-visual input, outperforming prior methods.

Contribution

The paper presents ReactDiff, a new diffusion-based model that combines multi-modal transformers and latent diffusion for improved facial reaction generation.

Findings

01

Achieves a facial reaction correlation of 0.26

02

Attains a diversity score of 0.094

03

Maintains competitive realism in generated reactions

Abstract

Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener's facial reactions. The challenge lies in capturing the relevance between video and audio while balancing appropriateness, realism, and diversity. While prior works have mostly focused on uni-modal inputs or simplified reaction mappings, recent approaches such as PerFRDiff have explored multi-modal inputs and the one-to-many nature of appropriate reaction mappings. In this work, we propose the Facial Reaction Diffusion (ReactDiff) framework that uniquely integrates a Multi-Modality Transformer with conditional diffusion in the latent space for enhanced reaction generation. Unlike existing methods, ReactDiff leverages intra- and inter-class attention for fine-grained multi-modal interaction, while the latent diffusion process between the encoder and decoder enables diverse yet contextually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hunan-tiger/reactdiff
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Speech and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion