Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Juanxi Tian; Siyuan Li; Conghui He; Lijun Wu; Cheng Tan

arXiv:2512.01816·cs.CV·December 2, 2025

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

PDF

Open Access 1 Datasets

TL;DR

Envision introduces a benchmark and metric for evaluating multimodal models on their ability to generate and understand dynamic, causally-structured visual narratives over time, addressing static pattern matching limitations.

Contribution

It proposes Envision, a causal event progression benchmark with a new holistic metric, enabling evaluation of models' understanding of spatiotemporal causality in multi-image generation.

Findings

01

Unified models outperform specialized T2I models in causal coherence

02

Specialized T2I models excel in aesthetics but lack world knowledge

03

Even advanced models struggle with spatiotemporal consistency

Abstract

Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OpenRaiser/Envision
dataset· 118 dl
118 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Face Recognition and Perception