Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Xinyu Liu; Hangjie Yuan; Yujie Wei; Jiazheng Xing; Yujin Han; Jiahao Pan; Yanbiao Ma; Chi-Min Chan; Kang Zhao; Shiwei Zhang; Wenhan Luo; Yike Guo

arXiv:2512.09924·cs.CV·March 17, 2026

Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo

PDF

Open Access

TL;DR

This paper introduces a new reasoning-aware video editing task and benchmark, and proposes ReViSE, a self-reflective learning framework that improves video editing by leveraging internal vision-language models for evaluation and refinement.

Contribution

The paper presents the RVE task and RVE-Bench benchmark, and introduces ReViSE, a novel self-reflective learning method using internal VLMs for improved reasoning in video editing.

Findings

01

ReViSE outperforms finetuned models by 10% on RAVE subset.

02

The benchmark covers diverse reasoning dimensions in editing and generation.

03

Self-reflective learning enhances editing accuracy and visual fidelity.

Abstract

Unified video models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: (1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and (2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents understanding from guiding the editing process. To address this, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Aware Video Editing (RAVE) and In-Context Video-to-Video Generation (ICVG), spanning diverse reasoning dimensions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection