Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu; Zuhao Yang; Kaichen Zhang; Shizun Wang; Haowei Zhu; Sicong Leng; Zhongyu Yang; Qijie Wang; Sudong Wang; Ziting Wang; Zili Wang; Hui Zhang; Haonan Wang; Hang Zhou; Yifan Pu; Xingxuan Li; Fangneng Zhan; Bo Li; Lidong Bing; Yuxin Song; Ziwei Liu; Wenhu Chen; Jingdong Wang; Xinchao Wang; Xiaojuan Qi; Shijian Lu; Bin Wang

arXiv:2604.28185·cs.CV·May 1, 2026

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen

PDF

1 Repo

TL;DR

This paper advocates shifting visual generation research from appearance-focused synthesis to models that incorporate structure, dynamics, and causal understanding, proposing a new taxonomy and evaluation framework.

Contribution

It introduces a five-level taxonomy for visual generation, analyzes key technical drivers, and proposes a comprehensive evaluation approach emphasizing structural and causal capabilities.

Findings

01

Current evaluations overemphasize perceptual quality.

02

Structural, temporal, and causal failures are often overlooked.

03

The proposed roadmap guides future development of intelligent visual generators.

Abstract

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evolvinglmms-lab/Evolving-Visual-Generation
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.