VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation
Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao, Qinyu Yang, Qi Chen, Qin Lin, Chuyue Li, Tao Gao, Yuhao Shan, Shuai Shao, Song Guo, Qinglin Lu

TL;DR
VisionCreator is a unified, end-to-end visual-generation model that integrates understanding, thinking, planning, and creation capabilities, enabling autonomous complex visual content creation.
Contribution
It introduces a novel framework with specialized data, training methods, and benchmarks for developing agentic models with multi-step visual creation skills.
Findings
VisionCreator-8B/32B outperform larger closed-source models.
High-quality creation trajectories generated with explicit UTPC structures.
Effective training via Progressive Specialization and Virtual Reinforcement Learning.
Abstract
Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Design Education and Practice
