TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer; Omer Bar-Tal; Shai Bagon; Tali Dekel

arXiv:2307.10373·cs.CV·November 21, 2023·40 cites

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

TokenFlow introduces a novel video editing framework that leverages diffusion feature consistency to produce high-quality, text-driven edited videos without additional training, maintaining spatial and motion coherence.

Contribution

It presents a training-free method that enforces diffusion feature consistency for improved video editing quality and control using existing text-to-image diffusion models.

Findings

01

Achieves state-of-the-art video editing results

02

Maintains spatial layout and motion coherence

03

Does not require training or fine-tuning

Abstract

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

- The video editing results are impressive, the temporal consistency is pretty good. - The analysis and visualization of UNet features on video tasks are helpful for future research on video generation. - The idea of TokenFlow is novel. Based on the ablation study and qualitative results in the supplemental material, TokenFlow is also very critical to good temporal consistency. - The paper reads well and is easy to follow.

Weaknesses

Although it's not necessary, it will be helpful to compare TokenFlow with Pix2Video.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

S1: Sensible model design Although the ideas of 1) using text-to-image diffusion model for video generation and 2) using latent feature flow for temporal consistency are not new, the proposed framework combines these components sensibly. The simplicity of this method also makes it compatible with existing video editing methods and more efficient than most prior arts. S2: Temporally consistent results The visual results show a significant improvement from prior methods in terms of temporal consi

Weaknesses

W1: Novelty The novelty of the proposed framework is slightly limited, considering that the key components (keyframe sampling, feature aggregation and propagation across frames) are introduced in prior works. Also, it is unclear which part of Section 4.1 is newly proposed in the paper and which is borrowed from other works. It would be great if the authors can elaborate on the main differences from prior methods and specify the novel components/modifications. W2: Limited structural deviation

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

1. The proposed method is simple and lightweight. It is built on an existing diffusion-based image editing method and does not need to fine-tune the model. The "TokenFLow Editing" algorithm is easy to implement (code available in the supplementary material). It directly utilizes Stable Diffusion, DDIM inversion, and PnP-Diffusion, and the TokenFlow procedure just requires computing the nearest-neighbor fields for token feature maps. Further, as mentioned in the summary in Sec. 1 of the paper,

Weaknesses

1. The results in the paper and the supplementary material mainly demonstrate the visual effect of video style transfer. For more general video editing tasks, one might expect to see some results of motion-based or composition-based video editing. Since the proposed method relies on the feature correspondences in the original video, it seems not trivial if one would like to modify the TokenFlow for motion-based editing. 2. Regarding the quantitative evaluation: - The *edit fidelity* measured

Code & Models

Repositories

omerbt/tokenflow
pytorch

Videos

TokenFlow: Consistent Diffusion Features for Consistent Video Editing· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization

MethodsDiffusion