Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models
Cong Cao, Huanjing Yue, Shangbin Xie, Xin Liu, Jingyu Yang

TL;DR
This paper introduces a novel, training-free framework that leverages video diffusion models to enhance zero-shot video restoration and enhancement, significantly reducing temporal flickering and improving consistency.
Contribution
It presents the first framework combining image-based methods with video diffusion models for zero-shot video restoration, introducing fusion strategies and post-processing for better temporal coherence.
Findings
Outperforms existing methods in temporal consistency.
Effective fusion strategies improve restoration quality.
Training-free approach applicable to various diffusion models.
Abstract
Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods.…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The method is unsupervised and training-free, leveraging existing pre-trained models, which makes it practical. 2. The method generally shows improvements compared to the baseline or other compared methods.
1. I think the method section could be much better presented and many details should be introduced. I struggled to fully understand the method. How do you get the noise $z_T$ in line 201? Do you invert the degraded video? How do you use the T2V model to generate a video similar to the input one? Which text prompt are you using? In line 182, you mentioned that your input is only a video. 2. Based on my understanding, I think the performance depends heavily on the hyperparameters used for late
(1) I think the goal is practical: using actual video priors to fix temporal instability in zero-shot diffusion IR. (2) The heterogeneous latent bridge (2D↔3D VAE encode/decode to align latents) is a straightforward engineering workaround that makes modern T2V usable. (3) The pipeline is training-free w.r.t. new networks and slots into several zero-shot IR backbones. (4) Results show lower WE/FVD and perceptual gains over PSLD-only variants; ablations indicate each block helps.
(1) “First framework” is oversold. I think the novelty is thinner than claimed. Homologous fusion is basically FVDM-style latent mixing applied to restoration rather than editing; heterogeneous fusion is a vanilla encode–decode bridge between VAEs; and the “CoT-based” strategy is just a best-of-N hyperparameter search per timestep with two off-the-shelf metrics. Slapping “CoT” on a verifier-guided grid search doesn’t make it reasoning-based. The pitch feels buzzwordy rather than conceptually new
- The proposed zero-shot setting is novel and practically motivated. - Integrating diffusion priors with temporal alignment is conceptually elegant. - The method demonstrates versatility across multiple restoration tasks.
- Lacks comparison with recent diffusion-based video restoration models such as **Upscale-A-Video (CVPR 2024)** and **SeedVR (CVPR 2025)**, making it hard to gauge true competitiveness. - No runtime, peak memory, or parameter analysis is provided, which limits understanding of efficiency and scalability. • Temporal consistency evaluation is weak, reporting only **Warping Error (WE)** without metrics like **DOVER** or **tLPIPS**, which better reflect human-perceived temporal coherence and det
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Image and Video Quality Assessment · Image Enhancement Techniques
