Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

Shangwen Zhu; Qianyu Peng; Zhao Pu; Zhilei Shu; Xiangrui Ke; Zhaohu Xing; Zizhao Tong; Zeqing Wang; Xinyu Cui; Huangji Wang; Jian Zhao; Yeying Jin; Fan Cheng; Ruili Feng

arXiv:2605.18601·cs.CV·May 19, 2026

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke, Zhaohu Xing, Zizhao Tong, Zeqing Wang, Xinyu Cui, Huangji Wang, Jian Zhao, Yeying Jin, Fan Cheng, Ruili Feng

PDF

1 Repo

TL;DR

Incantation introduces a natural language interface for multi-entity video world models, enabling fine-grained control and cross-entity generalization beyond traditional fixed pipelines.

Contribution

It is the first interactive video world model with per-latent-frame natural-language conditioning supporting multi-entity control and concept-level transfer.

Findings

01

Surpasses Action-Index baseline on cross-entity transfer (89% vs. 43%)

02

Achieves 90% accuracy on out-of-vocabulary prompts

03

Maintains 19.7 FPS at 480p with stable FVD over 2-hour rollouts

Abstract

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhushangwen/Incantation
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.