MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance
Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen

TL;DR
MeDM is a novel method that leverages pre-trained image diffusion models for consistent, efficient video-to-video translation and editing, ensuring temporal coherence without fine-tuning or test-time optimization.
Contribution
It introduces a framework that enforces temporal consistency using optical flows and physical constraints, compatible with existing diffusion models, without requiring additional training.
Findings
Achieves high-quality, temporally consistent video translation.
Outperforms existing methods on various benchmarks.
Enables text-guided video editing without fine-tuning.
Abstract
This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Cancer-related molecular mechanisms research
MethodsDiffusion
