DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Chang-Han Yeh; Hau-Shiang Shiu; Chin-Yang Lin; Zhixiang Wang; Chi-Wei Hsiao; Ting-Hsuan Chen; Yu-Lun Liu

arXiv:2407.01519·cs.CV·January 1, 2026·1 cites

DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Chang-Han Yeh, Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Ting-Hsuan Chen, Yu-Lun Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

DiffIR2VR-Zero is a versatile zero-shot video restoration framework that leverages pre-trained image diffusion models, ensuring high-quality, temporally consistent results across various degradation scenarios without additional training.

Contribution

It introduces a hierarchical latent warping and hybrid token merging mechanism, enabling effective video restoration with existing image diffusion models without retraining.

Findings

01

Achieves superior temporal consistency across diverse datasets.

02

Handles challenging scenarios like super-resolution and severe noise.

03

Works with any pre-trained image diffusion model without modifications.

Abstract

We present DiffIR2VR-Zero, a zero-shot framework that enables any pre-trained image restoration diffusion model to perform high-quality video restoration without additional training. While image diffusion models have shown remarkable restoration capabilities, their direct application to video leads to temporal inconsistencies, and existing video restoration methods require extensive retraining for different degradation types. Our approach addresses these challenges through two key innovations: a hierarchical latent warping strategy that maintains consistency across both keyframes and local frames, and a hybrid token merging mechanism that adaptively combines optical flow and feature matching. Through extensive experiments, we demonstrate that our method not only maintains the high-quality restoration of base diffusion models but also achieves superior temporal consistency across diverse…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. The method is training-free which makes it computationally practical.

Weaknesses

1. To my understanding, the method relies on certain architectural heuristics inspired by the video-editing literature rather than on a solid mathematical framework. Since the method involves no training, there is no theoretical guarantee that it won't fail under certain conditions. 2. The method combines existing ideas from video editing, which on my opinion is still acceptable; however, this limits its novelty. 3. The improvement over the baseline on some datasets, particularly in the case o

Reviewer 02Rating 5Confidence 3

Strengths

- The paper is well-structured, making the methodology and findings easy to understand. - The method achieves competitive results without requiring additional training.

Weaknesses

- Both the latent warping and hybrid flow-guided token merging approaches rely heavily on optical flow information, which could be a limitation in cases where optical flow estimation is challenging or inaccurate. - The paper's novelty is somewhat limited, as it primarily combines two existing methodologies with minor modifications. Specifically, the contributions include adjusting the range of warping frames at global and local levels and introducing a flow-guided confidence criterion for token

Reviewer 03Rating 5Confidence 4

Strengths

The primary contribution of this approach is its ability to leverage conventional image generation models directly, without requiring modifications to network architecture or the need for retraining or fine-tuning. This is achieved through a straightforward yet effective technique: hierarchical token merging within the latent space, which ensures temporal consistency across generated video frames. By using this token merging strategy, the method successfully adapts static image models for dynami

Weaknesses

First, the paper’s structure requires improvement, as the current organization makes it challenging to follow. A major concern is the lack of a comprehensive comparison between this approach and conventional methods, such as VidToMe and Upscale-A-Video. VidToMe introduces local and global token merging techniques, while Upscale-A-Video presents a flow-based merging approach—both key contributions relevant to this work. Please clarify how this approach differentiates itself from these methods, a

Code & Models

Repositories

jimmycv07/DiffIR2VR-Zero
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Medical Imaging Techniques and Applications · Image and Signal Denoising Methods

MethodsDiffusion