In-Video Instructions: Visual Signals as Generative Control

Gongfan Fang; Xinyin Ma; Xinchao Wang

arXiv:2511.19401·cs.CV·November 25, 2025

In-Video Instructions: Visual Signals as Generative Control

Gongfan Fang, Xinyin Ma, Xinchao Wang

PDF

Open Access

TL;DR

This paper introduces In-Video Instruction, a method for controlling video generation by embedding visual signals like text and arrows directly into frames, enabling explicit and spatially-aware user guidance.

Contribution

It proposes a novel visual signal-based control paradigm for video generation, allowing explicit, spatially-aware instructions embedded directly in frames.

Findings

01

Video models reliably interpret embedded visual instructions

02

Effective control demonstrated in complex multi-object scenarios

03

Compatible with multiple state-of-the-art generators

Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition