The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang; Hao Ouyang; Qiuyu Wang; Yue Yu; Yihao Meng; Wen Wang; Ka Leong Cheng; Shuailei Ma; Qingyan Bai; Yixuan Li; Cheng Chen; Yanhong Zeng; Xing Zhu; Yujun Shen; Qifeng Chen

arXiv:2512.16924·cs.CV·December 19, 2025

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

PDF

Open Access

TL;DR

WorldCanvas is a multimodal framework that enables user-controlled, coherent, and visually grounded simulation of complex world events using text, trajectories, and reference images.

Contribution

It introduces a novel multimodal approach combining trajectories, text, and images to generate controllable, coherent multi-agent scene videos, advancing interactive world modeling.

Findings

01

Generates temporally coherent, multi-agent videos with object identity preservation.

02

Supports complex events like object entry/exit and counterintuitive scenarios.

03

Demonstrates emergent scene consistency despite temporary object disappearance.

Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications