VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Jinxiang Lai; Zexin Lu; Jiajun He; Rongwei Quan; Wenzhe Zhao; Qinyu Yang; Qi Chen; Qin Lin; Chuyue Li; Tao Gao; Yuhao Shan; Shuai Shao; Song Guo; Qinglin Lu

arXiv:2603.02681·cs.CV·March 4, 2026

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao, Qinyu Yang, Qi Chen, Qin Lin, Chuyue Li, Tao Gao, Yuhao Shan, Shuai Shao, Song Guo, Qinglin Lu

PDF

Open Access

TL;DR

VisionCreator is a unified, end-to-end visual-generation model that integrates understanding, thinking, planning, and creation capabilities, enabling autonomous complex visual content creation.

Contribution

It introduces a novel framework with specialized data, training methods, and benchmarks for developing agentic models with multi-step visual creation skills.

Findings

01

VisionCreator-8B/32B outperform larger closed-source models.

02

High-quality creation trajectories generated with explicit UTPC structures.

03

Effective training via Progressive Specialization and Virtual Reinforcement Learning.

Abstract

Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Design Education and Practice