LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
Weicheng Wang, Zhicheng Zhang, Zhongqi Zhang, Juncheng Zhou, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang

TL;DR
LIVE introduces a novel framework combining image and video editing data, utilizing a frame-wise token noise strategy and a two-stage training process to enhance instruction-based video editing performance.
Contribution
The paper proposes a joint training approach leveraging large-scale image editing data and a new token noise strategy to improve video editing capabilities.
Findings
Achieves state-of-the-art performance on a comprehensive video editing benchmark.
Effectively mitigates domain gap between images and videos using token noise.
Demonstrates the benefit of combining image and video data for training.
Abstract
Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
