Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

Xiaofan Li; Yanpeng Sun; Chenming Wu; Fan Duan; YuAn Wang; Weihao Bo; Yumeng Zhang; Dingkang Liang

arXiv:2511.18131·cs.CV·November 25, 2025

Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang

PDF

Open Access

TL;DR

Video4Edit introduces a novel perspective by modeling image editing as a degenerate temporal process, leveraging video pre-training to achieve efficient and high-quality image editing with minimal supervision.

Contribution

It proposes a new temporal modeling approach for image editing, reducing data requirements by transferring priors from video pre-training.

Findings

01

Matches performance of leading baselines

02

Uses only about 1% of the supervision needed by mainstream models

03

Achieves high-quality editing with efficient fine-tuning

Abstract

We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection