ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

Jiayang Xu; Fan Zhuo; Majun Zhang; Changhao Pan; Zehan Wang; Siyu Chen; Xiaoda Yang; Tao Jin; Zhou Zhao

arXiv:2604.07958·cs.CV·April 24, 2026

ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

Jiayang Xu, Fan Zhuo, Majun Zhang, Changhao Pan, Zehan Wang, Siyu Chen, Xiaoda Yang, Tao Jin, Zhou Zhao

PDF

TL;DR

ImVideoEdit is a novel image-based video editing framework that efficiently learns from image pairs, preserving temporal dynamics and enabling accurate, text-guided spatial modifications with low computational cost.

Contribution

It introduces a decoupled spatiotemporal approach using image pairs and a novel spatial difference attention module for effective video editing.

Findings

01

Achieves editing fidelity comparable to larger models trained on extensive video data.

02

Requires only 13K image pairs and 5 epochs for training with low computational overhead.

03

Maintains temporal consistency while enabling precise spatial edits.

Abstract

Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.