Video-As-Prompt: Unified Semantic Control for Video Generation

Yuxuan Bian; Xin Chen; Zenan Li; Tiancheng Zhi; Shen Sang; Linjie Luo; Qiang Xu

arXiv:2510.20888·cs.CV·October 27, 2025

Video-As-Prompt: Unified Semantic Control for Video Generation

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu

PDF

2 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces Video-As-Prompt (VAP), a novel approach for semantic control in video generation that uses reference videos as prompts, achieving state-of-the-art results and strong zero-shot generalization.

Contribution

VAP reframes semantic control as in-context generation using a frozen transformer and a large new dataset, enabling robust, unified, and zero-shot controllable video synthesis.

Findings

01

VAP achieves 38.7% user preference rate, rivaling commercial models.

02

VAP demonstrates strong zero-shot generalization across semantic conditions.

03

VAP sets a new state-of-the-art for open-source semantic-controlled video generation.

Abstract

Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

Unified and Generalizable Model: VAP is, to the authors' knowledge, the first framework to successfully unify a diverse set of semantic controls into a single model without requiring per-task modules or finetuning. Strong Zero-Shot Performance: The model shows a strong ability to generalize to semantic conditions that were not included in the VAP-Data, such as "crumble" and "dissolve". This indicates it is learning a generalizable concept of semantic transfer. VAP-Data: The paper introduces th

Weaknesses

he paper is strong and transparently discusses its own limitations. Synthetic Data Limitations: The primary weakness, is that VAP-Data is entirely synthetic, generated using other models. The paper notes this means VAP may inherit the "stylistic biases, artifacts, and conceptual limitations" (e.g., bad hands) of these source models Dependence on Caption Quality: Performance relies on well-aligned semantic descriptions in the reference and target captions. The authors show that mislabeled caption

Reviewer 02Rating 6Confidence 3

Strengths

+ The paper introduces a clean and unified perspective by treating a reference video as a semantic prompt, avoiding the fragmented control paradigms (e.g., pose/depth-specific pipelines) used in prior work. + The MoT structure and temporally-biased RoPE are well-motivated and demonstrated to be effective in preventing forgetting and cross-token interference; ablations support the design choices. + Strong qualitative performance across multiple semantic axes (concept, style, motion, camera inte

Weaknesses

- The training data is largely synthetic and template-driven, which may limit generalization to real-world video distributions; robustness to natural, diverse videos is not extensively evaluated. - The MoT architecture increases compute cost and memory footprint, making the method relatively heavy compared to lightweight or plug-in control modules. - The method assumes reasonably descriptive captions for reference and target videos; behavior under noisy or under-specified captions remains insu

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper proposes a unified “video-as-prompt” paradigm that reframes semantic control in video generation as an in-context learning problem, offering a clear conceptual advance over task-specific approaches. 2. The temporally biased RoPE effectively avoids pixel-level copying between reference and target videos, leading to more robust semantic alignment. 3. The Mixture-of-Transformers design enables plug-and-play integration with existing video diffusion transformers while preventing catastr

Weaknesses

Major: 1. The paper lacks a theoretical analysis explaining why in-context learning via Mixture-of-Transformers effectively transfers semantic patterns. 2. The proposed temporally biased RoPE is only justified empirically, without an ablation or analytical study on the optimal bias magnitude. 3. The inference cost roughly doubles due to the dual-transformer structure, yet efficiency and scalability trade-offs are not thoroughly studied. Minor: 1. The semantic diversity in VAP-Data is constraine

Code & Models

Models

Datasets

BianYx/VAP-Data
dataset· 69k dl
69k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.