Solving Video Inverse Problems Using Image Diffusion Models
Taesung Kwon, Jong Chul Ye

TL;DR
This paper introduces a novel video inverse problem solver that leverages image diffusion models by treating video frames as batch data, enabling effective spatio-temporal restoration without training dedicated video diffusion models.
Contribution
The method innovatively applies image diffusion models to videos by treating the temporal dimension as batch data and introduces a batch-consistent diffusion sampling strategy for improved spatio-temporal consistency.
Findings
Achieves state-of-the-art results on various video inverse problems.
Effectively restores videos with complex spatio-temporal degradations.
Demonstrates efficiency by avoiding training dedicated video diffusion models.
Abstract
Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper offers an innovative approach to video inverse problems by leveraging pre-trained image diffusion models for video tasks, eliminating the need for computationally intensive video diffusion model training. 2. The method is computationally efficient, with VRAM savings that make it feasible for deployment in lower-resource environments.
1. The paper lacks comparison with recent state-of-the-art video processing methods beyond diffusion-based approaches, which would provide a more comprehensive benchmark. 2. The evaluation is limited to the DAVIS dataset, potentially restricting insights into the model’s performance on a broader range of video types and characteristics, such as high-frame-rate and highly dynamic videos. 3. The degradations used in the experiments are synthesized through point spread functions (PSF), which may no
1. The paper utilizes image diffusion models to reduce the need for extensive video model training, leading to faster processing times. 2. The paper proposes a batch-consistent sampling strategy that ensures coherent frame generation, enhancing the visual quality of reconstructed videos. 3. The paper conducts experiments on various video inverse problems, such as deblurring and super-resolution, with improved reconstruction accuracy.
1. The ablation experiments in this paper could be more comprehensive. For instance, it remains unclear whether the choice of different image pre-trained models affects the final results and whether the proposed algorithm yields consistent conclusions across various pre-trained models. 2. The paper lacks testing on more complex real-world datasets, such as videos with low bitrate compression or older films. 3. The image quality in the manuscript is low, please provide a high-quality version.
1. The presentation is clear and intuitive, with well-defined formulations and symbols. 2. The proposed method is efficient and effective for video inverse problems. Experimental settings are clearly explained, and the results validate the effectiveness of the sampling strategy in improving temporal coherence. 3. The ablation study is comprehensive and provides a clear understanding of the role of each component.
Incremental Novelty: The main contribution of this paper lies in addressing the batch consistency problem by applying a 2D pretrained diffusion model. The paper highly dependent on established techniques for inverse problems, including Tweedie denoising and multi-step conjugate gradient (CG) for frame-dependent perturbation. While the results are impressive, the technical contribution appears relatively incremental.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods
MethodsDiffusion
