Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models
Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke, Zhaohu Xing, Zizhao Tong, Zeqing Wang, Xinyu Cui, Huangji Wang, Jian Zhao, Yeying Jin, Fan Cheng, Ruili Feng

TL;DR
Incantation introduces a natural language interface for multi-entity video world models, enabling fine-grained control and cross-entity generalization beyond traditional fixed pipelines.
Contribution
It is the first interactive video world model with per-latent-frame natural-language conditioning supporting multi-entity control and concept-level transfer.
Findings
Surpasses Action-Index baseline on cross-entity transfer (89% vs. 43%)
Achieves 90% accuracy on out-of-vocabulary prompts
Maintains 19.7 FPS at 480p with stable FVD over 2-hour rollouts
Abstract
Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
