Plan-X: Instruct Video Generation via Semantic Planning

Lun Huang; You Xie; Hongyi Xu; Tianpei Gu; Chenxu Zhang; Guoxian Song; Zenan Li; Xiaochen Zhao; Linjie Luo; Guillermo Sapiro

arXiv:2511.17986·cs.CV·November 25, 2025

Plan-X: Instruct Video Generation via Semantic Planning

Lun Huang, You Xie, Hongyi Xu, Tianpei Gu, Chenxu Zhang, Guoxian Song, Zenan Li, Xiaochen Zhao, Linjie Luo, Guillermo Sapiro

PDF

Open Access

TL;DR

Plan-X introduces a semantic planning framework that improves instruction-aligned video generation by reducing hallucinations and enhancing high-level reasoning in diffusion-based models.

Contribution

It proposes a learnable semantic planner that generates structured semantic tokens to guide video synthesis, addressing limitations of existing diffusion transformers.

Findings

01

Significantly reduces visual hallucinations in generated videos.

02

Enables fine-grained, instruction-aligned video synthesis.

03

Improves handling of complex scene understanding and multi-stage actions.

Abstract

Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user's intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation