Edit Temporal-Consistent Videos with Image Diffusion Model
Yuanzhi Wang, Yong Li, Xiaoya Zhang, Xin Liu, Anbo Dai, Antoni B., Chan, Zhen Cui

TL;DR
This paper introduces TCVE, a novel method that combines spatial and temporal Unets to improve temporal consistency in text-guided video editing, achieving state-of-the-art results.
Contribution
The paper proposes a new temporal Unet architecture and a spatial-temporal modeling unit to enhance temporal coherence in video editing using diffusion models.
Findings
TCVE outperforms existing methods in temporal consistency.
The approach maintains high-quality content manipulation.
Quantitative results show state-of-the-art performance.
Abstract
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained T2I 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive spatial-temporal modeling unit is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
MethodsDiffusion
