Multimodal Emotion Recognition via Causal-Diffusion Bridge (Affect-Diff)

Ankit Sanjyal

arXiv:2605.08252·cs.CV·May 12, 2026

Multimodal Emotion Recognition via Causal-Diffusion Bridge (Affect-Diff)

Ankit Sanjyal

PDF

TL;DR

Affect-Diff is a novel multimodal emotion recognition model that uses causal graph re-weighting, latent regularization, and a structured diffusion prior to improve minority emotion detection on CMU-MOSEI.

Contribution

It introduces a Causal-Diffusion Bridge with three jointly trained mechanisms to address class imbalance in multimodal emotion recognition.

Findings

01

Achieves 18% relative improvement in balanced accuracy over baseline.

02

Detects all six emotion classes with the deterministic-encoder variant.

03

Ablation shows diffusion prior and causal graph are independently crucial.

Abstract

Multimodal emotion recognition on CMU-MOSEI faces an extreme imbalance as Happy accounts for 65.9% of samples while three Ekman categories collectively represent under 7%, causing standard fusion models to maximize accuracy by ignoring minority emotions entirely. We present Affect-Diff, a Causal-Diffusion Bridge that addresses this through three jointly trained mechanisms: a NOTEARS-learned causal graph that re-weights modality contributions before fusion, a beta-VAE bottleneck for regularized latent compression, and a stop-gradiented 1D DDPM prior that structures the latent space against majority-class collapse. On 3,292 aligned CMU-MOSEI samples, Affect-Diff achieves validation balanced accuracy 0.384, an 18% relative improvement over the strongest baseline (TETFN: 0.324), while all evaluated baselines produce zero F1 on Fear, Disgust, and Surprise. Ablation studies confirm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.