AffectVerse: Emotional World Models for Multimodal Affective Computing
Bo Zhao, Fanghua Ye, Yixin Ji, Sicheng Zhao, Xiaojiang Peng, Zitong YU

TL;DR
AffectVerse is a multimodal affective computing model that predicts future emotional states using a novel belief module, improving emotion recognition across multiple benchmarks.
Contribution
It introduces AffectVerse with an Emotion World Module that models affective dynamics through future prediction, a novel approach in multimodal emotion recognition.
Findings
AffectVerse outperforms existing models by at least 2.57% on nine benchmarks.
Temporal imagination and belief aggregation contribute to performance gains.
Predictive belief-state modeling is effective for affective computing.
Abstract
Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
