JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation
Hadrien Reynaud, Matthew Baugh, Mischa Dombrowski, Sarah, Cechnicka, Qingjie Meng, Bernhard Kainz

TL;DR
JVID introduces a joint diffusion approach combining image and video models to generate high-quality, temporally consistent videos, significantly improving realism and coherence in video synthesis.
Contribution
The paper presents a novel joint diffusion framework that integrates image and video diffusion models for improved video quality and temporal consistency.
Findings
Enhanced video realism and coherence demonstrated.
Quantitative improvements over existing methods.
Qualitative analysis confirms better temporal stability.
Abstract
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality and temporally coherent videos. We achieve this by integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our method combines these models in the reverse diffusion process, where the LIDM enhances image quality and the LVDM ensures temporal consistency. This unique combination allows us to effectively handle the complex spatio-temporal dynamics in video generation. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Advanced Image Processing Techniques
MethodsDiffusion
