Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

TL;DR
Envision introduces a benchmark and metric for evaluating multimodal models on their ability to generate and understand dynamic, causally-structured visual narratives over time, addressing static pattern matching limitations.
Contribution
It proposes Envision, a causal event progression benchmark with a new holistic metric, enabling evaluation of models' understanding of spatiotemporal causality in multi-image generation.
Findings
Unified models outperform specialized T2I models in causal coherence
Specialized T2I models excel in aesthetics but lack world knowledge
Even advanced models struggle with spatiotemporal consistency
Abstract
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Face Recognition and Perception
