EasyV2V: A High-quality Instruction-based Video Editing Framework
Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

TL;DR
EasyV2V is a novel, flexible instruction-based video editing framework that leverages pretrained models, diverse data synthesis, and unified control mechanisms to achieve state-of-the-art results in video editing tasks.
Contribution
It introduces a simple yet effective design for instruction-based video editing, combining data augmentation, model simplification, and unified control to outperform existing systems.
Findings
Achieves state-of-the-art video editing results.
Supports flexible input modalities including text, masks, and references.
Outperforms concurrent and commercial video editing systems.
Abstract
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications
