VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

TL;DR
VIVA introduces a scalable, VLM-guided framework for instruction-based video editing that enhances generalization and quality by leveraging visual-language models and reward optimization.
Contribution
The paper presents a novel VLM-guided encoding method and a reward optimization strategy, enabling more flexible and high-quality instruction-based video editing.
Findings
VIVA outperforms state-of-the-art methods in instruction following.
The framework generalizes well to complex, real-world instructions.
It produces content-preserving and aesthetically pleasing edits.
Abstract
Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
