COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
Jiangshan Wang, Yue Ma, Jiayi Guo, Yicheng Xiao, Gao Huang, Xiu Li

TL;DR
COVE introduces a novel diffusion feature correspondence method for consistent, high-quality video editing that leverages inherent diffusion features, a sliding-window similarity strategy, and token merging to improve efficiency and temporal coherence without additional training.
Contribution
The paper presents a new diffusion feature correspondence approach for video editing, enabling temporal consistency and efficiency without extra training or optimization.
Findings
Achieves state-of-the-art performance in various video editing scenarios.
Outperforms existing methods both quantitatively and qualitatively.
Efficiently reduces GPU memory usage and accelerates editing process.
Abstract
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video in a zero-shot manner. Despite extensive efforts, maintaining the temporal consistency of edited videos remains challenging due to the lack of temporal constraints in the regular T2I diffusion model. To address this issue, we propose COrrespondence-guided Video Editing (COVE), leveraging the inherent diffusion feature correspondence to achieve high-quality and consistent video editing. Specifically, we propose an efficient sliding-window-based strategy to calculate the similarity among tokens in the diffusion features of source videos, identifying the tokens with high correspondence across frames. During the inversion and denoising process, we sample the tokens in noisy latent based on the correspondence and then perform self-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies
MethodsSigmoid Activation · Tanh Activation · Location-based Attention · Long Short-Term Memory · Softmax · GloVe Embeddings · Sequence to Sequence · Diffusion · Bidirectional LSTM · Contextual Word Vectors
