Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models

Cong Cao; Huanjing Yue; Shangbin Xie; Xin Liu; Jingyu Yang

arXiv:2601.21922·cs.CV·January 30, 2026

Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models

Cong Cao, Huanjing Yue, Shangbin Xie, Xin Liu, Jingyu Yang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel, training-free framework that leverages video diffusion models to enhance zero-shot video restoration and enhancement, significantly reducing temporal flickering and improving consistency.

Contribution

It presents the first framework combining image-based methods with video diffusion models for zero-shot video restoration, introducing fusion strategies and post-processing for better temporal coherence.

Findings

01

Outperforms existing methods in temporal consistency.

02

Effective fusion strategies improve restoration quality.

03

Training-free approach applicable to various diffusion models.

Abstract

Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

1. The method is unsupervised and training-free, leveraging existing pre-trained models, which makes it practical. 2. The method generally shows improvements compared to the baseline or other compared methods.

Weaknesses

1. I think the method section could be much better presented and many details should be introduced. I struggled to fully understand the method. How do you get the noise $z_T$ in line 201? Do you invert the degraded video? How do you use the T2V model to generate a video similar to the input one? Which text prompt are you using? In line 182, you mentioned that your input is only a video. 2. Based on my understanding, I think the performance depends heavily on the hyperparameters used for late

Reviewer 02Rating 4Confidence 4

Strengths

(1) I think the goal is practical: using actual video priors to fix temporal instability in zero-shot diffusion IR. (2) The heterogeneous latent bridge (2D↔3D VAE encode/decode to align latents) is a straightforward engineering workaround that makes modern T2V usable. (3) The pipeline is training-free w.r.t. new networks and slots into several zero-shot IR backbones. (4) Results show lower WE/FVD and perceptual gains over PSLD-only variants; ablations indicate each block helps.

Weaknesses

(1) “First framework” is oversold. I think the novelty is thinner than claimed. Homologous fusion is basically FVDM-style latent mixing applied to restoration rather than editing; heterogeneous fusion is a vanilla encode–decode bridge between VAEs; and the “CoT-based” strategy is just a best-of-N hyperparameter search per timestep with two off-the-shelf metrics. Slapping “CoT” on a verifier-guided grid search doesn’t make it reasoning-based. The pitch feels buzzwordy rather than conceptually new

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed zero-shot setting is novel and practically motivated. - Integrating diffusion priors with temporal alignment is conceptually elegant. - The method demonstrates versatility across multiple restoration tasks.

Weaknesses

- Lacks comparison with recent diffusion-based video restoration models such as **Upscale-A-Video (CVPR 2024)** and **SeedVR (CVPR 2025)**, making it hard to gauge true competitiveness. - No runtime, peak memory, or parameter analysis is provided, which limits understanding of efficiency and scalability. • Temporal consistency evaluation is weak, reporting only **Warping Error (WE)** without metrics like **DOVER** or **tLPIPS**, which better reflect human-perceived temporal coherence and det

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Image and Video Quality Assessment · Image Enhancement Techniques