AffectVerse: Emotional World Models for Multimodal Affective Computing

Bo Zhao; Fanghua Ye; Yixin Ji; Sicheng Zhao; Xiaojiang Peng; Zitong YU

arXiv:2605.19950·cs.CV·May 20, 2026

AffectVerse: Emotional World Models for Multimodal Affective Computing

Bo Zhao, Fanghua Ye, Yixin Ji, Sicheng Zhao, Xiaojiang Peng, Zitong YU

PDF

TL;DR

AffectVerse is a multimodal affective computing model that predicts future emotional states using a novel belief module, improving emotion recognition across multiple benchmarks.

Contribution

It introduces AffectVerse with an Emotion World Module that models affective dynamics through future prediction, a novel approach in multimodal emotion recognition.

Findings

01

AffectVerse outperforms existing models by at least 2.57% on nine benchmarks.

02

Temporal imagination and belief aggregation contribute to performance gains.

03

Predictive belief-state modeling is effective for affective computing.

Abstract

Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.