TokenFlow: Consistent Diffusion Features for Consistent Video Editing
Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel

TL;DR
TokenFlow introduces a novel video editing framework that leverages diffusion feature consistency to produce high-quality, text-driven edited videos without additional training, maintaining spatial and motion coherence.
Contribution
It presents a training-free method that enforces diffusion feature consistency for improved video editing quality and control using existing text-to-image diffusion models.
Findings
Achieves state-of-the-art video editing results
Maintains spatial layout and motion coherence
Does not require training or fine-tuning
Abstract
The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training…
Peer Reviews
Decision·ICLR 2024 poster
- The video editing results are impressive, the temporal consistency is pretty good. - The analysis and visualization of UNet features on video tasks are helpful for future research on video generation. - The idea of TokenFlow is novel. Based on the ablation study and qualitative results in the supplemental material, TokenFlow is also very critical to good temporal consistency. - The paper reads well and is easy to follow.
Although it's not necessary, it will be helpful to compare TokenFlow with Pix2Video.
S1: Sensible model design Although the ideas of 1) using text-to-image diffusion model for video generation and 2) using latent feature flow for temporal consistency are not new, the proposed framework combines these components sensibly. The simplicity of this method also makes it compatible with existing video editing methods and more efficient than most prior arts. S2: Temporally consistent results The visual results show a significant improvement from prior methods in terms of temporal consi
W1: Novelty The novelty of the proposed framework is slightly limited, considering that the key components (keyframe sampling, feature aggregation and propagation across frames) are introduced in prior works. Also, it is unclear which part of Section 4.1 is newly proposed in the paper and which is borrowed from other works. It would be great if the authors can elaborate on the main differences from prior methods and specify the novel components/modifications. W2: Limited structural deviation
1. The proposed method is simple and lightweight. It is built on an existing diffusion-based image editing method and does not need to fine-tune the model. The "TokenFLow Editing" algorithm is easy to implement (code available in the supplementary material). It directly utilizes Stable Diffusion, DDIM inversion, and PnP-Diffusion, and the TokenFlow procedure just requires computing the nearest-neighbor fields for token feature maps. Further, as mentioned in the summary in Sec. 1 of the paper,
1. The results in the paper and the supplementary material mainly demonstrate the visual effect of video style transfer. For more general video editing tasks, one might expect to see some results of motion-based or composition-based video editing. Since the proposed method relies on the feature correspondences in the original video, it seems not trivial if one would like to modify the TokenFlow for motion-based editing. 2. Regarding the quantitative evaluation: - The *edit fidelity* measured
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
MethodsDiffusion
