VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong; Haotian Yang; Angtian Wang; Yizhi Wang; Yiding Yang; Canyu Zhang; Chongyang Ma

arXiv:2512.16906·cs.CV·December 19, 2025

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

PDF

Open Access

TL;DR

VIVA introduces a scalable, VLM-guided framework for instruction-based video editing that enhances generalization and quality by leveraging visual-language models and reward optimization.

Contribution

The paper presents a novel VLM-guided encoding method and a reward optimization strategy, enabling more flexible and high-quality instruction-based video editing.

Findings

01

VIVA outperforms state-of-the-art methods in instruction following.

02

The framework generalizes well to complex, real-world instructions.

03

It produces content-preserving and aesthetically pleasing edits.

Abstract

Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications