In-Video Instructions: Visual Signals as Generative Control
Gongfan Fang, Xinyin Ma, Xinchao Wang

TL;DR
This paper introduces In-Video Instruction, a method for controlling video generation by embedding visual signals like text and arrows directly into frames, enabling explicit and spatially-aware user guidance.
Contribution
It proposes a novel visual signal-based control paradigm for video generation, allowing explicit, spatially-aware instructions embedded directly in frames.
Findings
Video models reliably interpret embedded visual instructions
Effective control demonstrated in complex multi-object scenarios
Compatible with multiple state-of-the-art generators
Abstract
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition
