Video-As-Prompt: Unified Semantic Control for Video Generation
Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu

TL;DR
This paper introduces Video-As-Prompt (VAP), a novel approach for semantic control in video generation that uses reference videos as prompts, achieving state-of-the-art results and strong zero-shot generalization.
Contribution
VAP reframes semantic control as in-context generation using a frozen transformer and a large new dataset, enabling robust, unified, and zero-shot controllable video synthesis.
Findings
VAP achieves 38.7% user preference rate, rivaling commercial models.
VAP demonstrates strong zero-shot generalization across semantic conditions.
VAP sets a new state-of-the-art for open-source semantic-controlled video generation.
Abstract
Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with…
Peer Reviews
Decision·ICLR 2026 Poster
Unified and Generalizable Model: VAP is, to the authors' knowledge, the first framework to successfully unify a diverse set of semantic controls into a single model without requiring per-task modules or finetuning. Strong Zero-Shot Performance: The model shows a strong ability to generalize to semantic conditions that were not included in the VAP-Data, such as "crumble" and "dissolve". This indicates it is learning a generalizable concept of semantic transfer. VAP-Data: The paper introduces th
he paper is strong and transparently discusses its own limitations. Synthetic Data Limitations: The primary weakness, is that VAP-Data is entirely synthetic, generated using other models. The paper notes this means VAP may inherit the "stylistic biases, artifacts, and conceptual limitations" (e.g., bad hands) of these source models Dependence on Caption Quality: Performance relies on well-aligned semantic descriptions in the reference and target captions. The authors show that mislabeled caption
+ The paper introduces a clean and unified perspective by treating a reference video as a semantic prompt, avoiding the fragmented control paradigms (e.g., pose/depth-specific pipelines) used in prior work. + The MoT structure and temporally-biased RoPE are well-motivated and demonstrated to be effective in preventing forgetting and cross-token interference; ablations support the design choices. + Strong qualitative performance across multiple semantic axes (concept, style, motion, camera inte
- The training data is largely synthetic and template-driven, which may limit generalization to real-world video distributions; robustness to natural, diverse videos is not extensively evaluated. - The MoT architecture increases compute cost and memory footprint, making the method relatively heavy compared to lightweight or plug-in control modules. - The method assumes reasonably descriptive captions for reference and target videos; behavior under noisy or under-specified captions remains insu
1. The paper proposes a unified “video-as-prompt” paradigm that reframes semantic control in video generation as an in-context learning problem, offering a clear conceptual advance over task-specific approaches. 2. The temporally biased RoPE effectively avoids pixel-level copying between reference and target videos, leading to more robust semantic alignment. 3. The Mixture-of-Transformers design enables plug-and-play integration with existing video diffusion transformers while preventing catastr
Major: 1. The paper lacks a theoretical analysis explaining why in-context learning via Mixture-of-Transformers effectively transfers semantic patterns. 2. The proposed temporally biased RoPE is only justified empirically, without an ablation or analytical study on the optimal bias magnitude. 3. The inference cost roughly doubles due to the dual-transformer structure, yet efficiency and scalability trade-offs are not thoroughly studied. Minor: 1. The semantic diversity in VAP-Data is constraine
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
