Unlimited-Size Diffusion Restoration

Yinhuai Wang; Jiwen Yu; Runyi Yu; Jian Zhang

arXiv:2303.00354·cs.CV·March 2, 2023

Unlimited-Size Diffusion Restoration

Yinhuai Wang, Jiwen Yu, Runyi Yu, Jian Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel, parameter-free approach for zero-shot diffusion-based image restoration and generation of unlimited size images, overcoming fixed-size limitations while maintaining high quality.

Contribution

It proposes Mask-Shift Restoration and Hierarchical Restoration methods to handle arbitrary image sizes, addressing local and global coherence issues in diffusion models.

Findings

01

Effective arbitrary size image restoration without finetuning

02

Reduces artifacts and out-of-domain issues in large images

03

Applicable to both image restoration and generation tasks

Abstract

Recently, using diffusion models for zero-shot image restoration (IR) has become a new hot paradigm. This type of method only needs to use the pre-trained off-the-shelf diffusion models, without any finetuning, and can directly handle various IR tasks. The upper limit of the restoration performance depends on the pre-trained diffusion models, which are in rapid evolution. However, current methods only discuss how to deal with fixed-size images, but dealing with images of arbitrary sizes is very important for practical applications. This paper focuses on how to use those diffusion-based zero-shot IR methods to deal with any size while maintaining the excellent characteristics of zero-shot. A simple way to solve arbitrary size is to divide it into fixed-size patches and solve each patch independently. But this may yield significant artifacts since it neither considers the global semantics…

Equations56

x_{t} = a_{t} x_{0} + σ_{t} ϵ, ϵ \sim N (0, I)

x_{t} = a_{t} x_{0} + σ_{t} ϵ, ϵ \sim N (0, I)

x_{0∣ t} = \frac{1}{a _{t}} (x_{t} - σ_{t} ϵ_{t})

x_{0∣ t} = \frac{1}{a _{t}} (x_{t} - σ_{t} ϵ_{t})

ϵ_{t} = Z_{θ} (x_{t}, t)

ϵ_{t} = Z_{θ} (x_{t}, t)

x_{t - 1} = a_{t - 1} x_{0∣ t} + σ_{t - 1} ϵ, ϵ \sim N (0, I)

x_{t - 1} = a_{t - 1} x_{0∣ t} + σ_{t - 1} ϵ, ϵ \sim N (0, I)

x_{t - 1} = a_{t - 1} x_{0∣ t} + σ_{t - 1} (η_{t} ϵ + 1 - η_{t}^{2} ϵ_{t}), ϵ \sim N (0, I)

x_{t - 1} = a_{t - 1} x_{0∣ t} + σ_{t - 1} (η_{t} ϵ + 1 - η_{t}^{2} ϵ_{t}), ϵ \sim N (0, I)

\nabla_{θ} ∣∣ ϵ - Z_{θ} (x_{t}, t) ∣ ∣_{2}^{2} .

\nabla_{θ} ∣∣ ϵ - Z_{θ} (x_{t}, t) ∣ ∣_{2}^{2} .

Consistency : A \hat{x} \equiv y, Realness : \hat{x} \sim q (x),

Consistency : A \hat{x} \equiv y, Realness : \hat{x} \sim q (x),

\hat{x} = A^{†} y + (I - A^{†} A) x_{r} .

\hat{x} = A^{†} y + (I - A^{†} A) x_{r} .

{\color[rgb]{0,0,1}\hat{\mathbf{x}}_{0|t}}=\mathbf{A^{\dagger}}\mathbf{y}+(\mathbf{I}-\mathbf{A^{\dagger}}\mathbf{A})\mathbf{x}_{0|t}.

{\color[rgb]{0,0,1}\hat{\mathbf{x}}_{0|t}}=\mathbf{A^{\dagger}}\mathbf{y}+(\mathbf{I}-\mathbf{A^{\dagger}}\mathbf{A})\mathbf{x}_{0|t}.

\mathbf{x}_{t-1}=a_{t-1}{\color[rgb]{0,0,1}\hat{\mathbf{x}}_{0|t}}+\sigma_{t-1}(\eta_{t}\boldsymbol{\epsilon}+\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t}),\quad\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})

\mathbf{x}_{t-1}=a_{t-1}{\color[rgb]{0,0,1}\hat{\mathbf{x}}_{0|t}}+\sigma_{t-1}(\eta_{t}\boldsymbol{\epsilon}+\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t}),\quad\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})

\overset{ˉ}{x}_{0∣ t} = A_{m} \dot{x}_{0} + (I - A_{m}) \hat{x}_{0∣ t} .

\overset{ˉ}{x}_{0∣ t} = A_{m} \dot{x}_{0} + (I - A_{m}) \hat{x}_{0∣ t} .

{\color[rgb]{0,0,1}\tilde{\mathbf{x}}_{0|t}}=\mathbf{A_{sr}^{\dagger}}\ddot{\mathbf{x}}_{0}+(\mathbf{I}-\mathbf{A_{sr}^{\dagger}}\mathbf{A_{sr}})\mathbf{x}_{0|t}.

{\color[rgb]{0,0,1}\tilde{\mathbf{x}}_{0|t}}=\mathbf{A_{sr}^{\dagger}}\ddot{\mathbf{x}}_{0}+(\mathbf{I}-\mathbf{A_{sr}^{\dagger}}\mathbf{A_{sr}})\mathbf{x}_{0|t}.

\mathbf{\hat{x}}_{0|t}=\mathbf{x}_{0|t}+{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}(\mathbf{y}-\mathbf{A}\mathbf{x}_{0|t}),

\mathbf{\hat{x}}_{0|t}=\mathbf{x}_{0|t}+{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}(\mathbf{y}-\mathbf{A}\mathbf{x}_{0|t}),

\mathbf{x}_{t-1}=a_{t-1}\hat{\mathbf{x}}_{0|t}+\sigma_{t-1}({\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}\boldsymbol{\epsilon}+\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t}),\quad\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})

\mathbf{x}_{t-1}=a_{t-1}\hat{\mathbf{x}}_{0|t}+\sigma_{t-1}({\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}\boldsymbol{\epsilon}+\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t}),\quad\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})

a_{t-1}{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}\mathbf{n}+\sigma_{t-1}({\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}\boldsymbol{\epsilon}+\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t})\sim\mathcal{N}(0,\sigma_{t-1}^{2}\mathbf{I})

a_{t-1}{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}\mathbf{n}+\sigma_{t-1}({\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}\boldsymbol{\epsilon}+\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t})\sim\mathcal{N}(0,\sigma_{t-1}^{2}\mathbf{I})

a_{t-1}{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}\mathbf{n}+\sigma_{t-1}{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}\boldsymbol{\epsilon}\sim\mathcal{N}(0,\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I})

a_{t-1}{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}\mathbf{n}+\sigma_{t-1}{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}\boldsymbol{\epsilon}\sim\mathcal{N}(0,\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I})

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}({\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger})^{\top}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}^{\top}}=\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger}({\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}\mathbf{A}^{\dagger})^{\top}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}^{\top}}=\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}

A = UΣV^{⊤}, A^{†} = V Σ^{†} U^{⊤}

A = UΣV^{⊤}, A^{†} = V Σ^{†} U^{⊤}

{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}=\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{V}^{\top},{\color[rgb]{0,0,1}\mathbf{\Phi}_{t}}=\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}\mathbf{V}^{\top}

{\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}=\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{V}^{\top},{\color[rgb]{0,0,1}\mathbf{\Phi}_{t}}=\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}\mathbf{V}^{\top}

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{\Sigma}^{\dagger}(\mathbf{\Sigma}^{\dagger})^{\top}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{V}^{\top}+\sigma_{t-1}^{2}\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}^{2}\mathbf{V}^{\top}=\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{\Sigma}^{\dagger}(\mathbf{\Sigma}^{\dagger})^{\top}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{V}^{\top}+\sigma_{t-1}^{2}\mathbf{V}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}^{2}\mathbf{V}^{\top}=\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}

\mathbf{V}(a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{\Sigma}^{\dagger}(\mathbf{\Sigma}^{\dagger})^{\top}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}^{2})\mathbf{V}^{\top}=\mathbf{V}\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}\mathbf{V}^{\top}

\mathbf{V}(a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{\Sigma}^{\dagger}(\mathbf{\Sigma}^{\dagger})^{\top}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}^{2})\mathbf{V}^{\top}=\mathbf{V}\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}\mathbf{V}^{\top}

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{\Sigma}^{\dagger}(\mathbf{\Sigma}^{\dagger})^{\top}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}^{2}=\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}\mathbf{\Sigma}^{\dagger}(\mathbf{\Sigma}^{\dagger})^{\top}{\color[rgb]{0,0,1}\boldsymbol{\Lambda}_{t}}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\boldsymbol{\Gamma}_{t}}^{2}=\sigma_{t-1}^{2}\eta_{t}^{2}\mathbf{I}

{\color[rgb]{0,0,1}\mathbf{\Lambda}_{t}}=diag\{{\color[rgb]{0,0,1}\lambda_{t1},\lambda_{t2},\cdots,\lambda_{tD}}\}

{\color[rgb]{0,0,1}\mathbf{\Lambda}_{t}}=diag\{{\color[rgb]{0,0,1}\lambda_{t1},\lambda_{t2},\cdots,\lambda_{tD}}\}

{\color[rgb]{0,0,1}\mathbf{\Gamma}_{t}}=diag\{{\color[rgb]{0,0,1}\gamma_{t1},\gamma_{t2},\cdots,\gamma_{tD}}\}

{\color[rgb]{0,0,1}\mathbf{\Gamma}_{t}}=diag\{{\color[rgb]{0,0,1}\gamma_{t1},\gamma_{t2},\cdots,\gamma_{tD}}\}

Σ^{†} (Σ^{†})^{⊤} = d ia g {s_{1}^{2}, s_{2}^{2}, \dots, s_{D}^{2}}

Σ^{†} (Σ^{†})^{⊤} = d ia g {s_{1}^{2}, s_{2}^{2}, \dots, s_{D}^{2}}

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\lambda_{ti}}^{2}s_{i}^{2}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\gamma_{ti}}^{2}=\sigma_{t-1}^{2}\eta_{t}^{2}

a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\lambda_{ti}}^{2}s_{i}^{2}+\sigma_{t-1}^{2}{\color[rgb]{0,0,1}\gamma_{ti}}^{2}=\sigma_{t-1}^{2}\eta_{t}^{2}

\displaystyle{\color[rgb]{0,0,1}\gamma_{ti}}=\sqrt{\frac{\sigma_{t-1}^{2}\eta_{t}^{2}-a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\lambda_{ti}}^{2}s_{i}^{2}}{\sigma_{t-1}^{2}}}

\displaystyle{\color[rgb]{0,0,1}\gamma_{ti}}=\sqrt{\frac{\sigma_{t-1}^{2}\eta_{t}^{2}-a_{t-1}^{2}\sigma_{\mathbf{y}}^{2}{\color[rgb]{0,0,1}\lambda_{ti}}^{2}s_{i}^{2}}{\sigma_{t-1}^{2}}}

{\color[rgb]{0,0,1}\lambda_{ti}}=\begin{cases}1,&\sigma_{t-1}\eta_{t}\geq a_{t-1}\sigma_{\mathbf{y}}s_{i},\\ \frac{\sigma_{t-1}\eta_{t}}{a_{t-1}\sigma_{\mathbf{y}}s_{i}},&\sigma_{t-1}\eta_{t}<a_{t-1}\sigma_{\mathbf{y}}s_{i}\end{cases},

{\color[rgb]{0,0,1}\lambda_{ti}}=\begin{cases}1,&\sigma_{t-1}\eta_{t}\geq a_{t-1}\sigma_{\mathbf{y}}s_{i},\\ \frac{\sigma_{t-1}\eta_{t}}{a_{t-1}\sigma_{\mathbf{y}}s_{i}},&\sigma_{t-1}\eta_{t}<a_{t-1}\sigma_{\mathbf{y}}s_{i}\end{cases},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wyhuai/ddnm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Medical Imaging Techniques and Applications · Radiomics and Machine Learning in Medical Imaging

MethodsDiffusion

Full text

Unlimited-Size Diffusion Restoration

Yinhuai Wang1 Jiwen Yu1 Runyi Yu1 Jian Zhang1,2

1Peking University, SECE 2Peng Cheng Laboratory

Abstract

Recently, using diffusion models for zero-shot image restoration (IR) has become a new hot paradigm. This type of method only needs to use the pre-trained off-the-shelf diffusion models, without any finetuning, and can directly handle various IR tasks. The upper limit of the restoration performance depends on the pre-trained diffusion models, which are in rapid evolution. However, current methods only discuss how to deal with fixed-size images, but dealing with images of arbitrary sizes is very important for practical applications. This paper focuses on how to use those diffusion-based zero-shot IR methods to deal with any size while maintaining the excellent characteristics of zero-shot. A simple way to solve arbitrary size is to divide it into fixed-size patches and solve each patch independently. But this may yield significant artifacts since it neither considers the global semantics of all patches nor the local information of adjacent patches. Inspired by the Range-Null space Decomposition, we propose the Mask-Shift Restoration to address local incoherence and propose the Hierarchical Restoration to alleviate out-of-domain issues. Our simple, parameter-free approaches can be used not only for image restoration but also for image generation of unlimited sizes, with the potential to be a general tool for diffusion models. Code: https://github.com/wyhuai/DDNM/tree/main/hq_demo.

1 Introduction

Recent progress in diffusion models [26, 28, 10, 24, 8, 2] has enlightened a lot works in solving Image Restoration (IR) tasks [33, 4, 27, 13, 12, 17, 25, 6, 7, 5, 23, 21, 34]. These diffusion-based IR methods can be roughly divided into supervised [23, 21, 34, 14] and zero-shot [33, 4, 27, 13, 12, 25, 17, 6, 7, 5]. Among them, zero-shot methods have developed a new hot paradigm since they only need to use the pre-trained off-the-shelf diffusion model, and can directly handle various IR tasks without any finetuning. Since zero-shot methods are usually independent of the choice of Diffusion Models, they can achieve better performance once a more powerful Diffusion Model is available. In this paper, we focus on zero-shot methods [33, 4, 27, 13, 12, 25, 17, 6, 7, 5] which are concise, flexible, and in rapid progress.

Existing diffusion-based IR methods mainly focus on IR problems with fixed output sizes. But in real-world applications, the desired output size may be arbitrary, depending on the user’s demands. There are two main difficulties in applying these zero-shot IR methods to arbitrary output size: (1) The used diffusion models are usually pre-trained on fixed-size images, thus facing out-of-domain (OOD) issues when extending to arbitrary sizes; (2) The default network structure may not support arbitrary output size. The OOD issue can be solved by training the diffusion models with random cropped images. But the network structure constraint is hard to address. A common practice to bypass this constraint is to divide the input image into fixed-size patches and use the network to process each patch independently, then, concatenate the result patches as the final result, as shown in the middle of Fig. 1. However, this may lead to evident block artifacts and unreasonable restoration, because it neither considers the global semantics of all patches nor the local information of adjacent patches.

We observe that the neighboring correlation is well considered in inpainting tasks in DDNM [33], which inspired us to leave overlapped regions when dividing patches, then take the overlapped region as extra mask constraints when solving the following patches. We name this method Mask-Shift Restoration (MSR), which assures the coherence between patches and effectively eliminates boundary artifacts.

To further alleviate the OOD problem, we propose to first restore the result at a small size, then use the small result as a global prior for the final result. We name this method Hierarchical Restoration (HiR). Note that both MSR and HiR perfectly fit the zero-shot properties, and can be flexibly combined. The bottom of Fig. 1 shows the result using both MSR and HiR based on DDNM. From the perspective of Range-Null space Decomposition (RND), MSR and HiR are essentially adding extra linear constraints to the given inverse problem. This property makes it perfectly suitable for DDNM, which is exactly built on the principle of RND.

Our contribution includes:

We propose Mask-Shift Restoration (MSR), a simple but effective method to eliminate boundary artifacts when processing a large image in patches.

1.

We propose Hierarchical Restoration (HiR) to alleviate the out-of-domain problem and the lack of global semantics when processing a large image in patches.

2.

We provide typical pipelines for using MSR and HiR for diverse applications, including but not limited to image generation, super-resolution, colorization, inpainting, and denoising. It is worth noting that our proposed methods are parameter-free and training-free, and can be applied to diverse diffusion models and zero-shot restoration methods.

2 Preliminaries

2.1 Diffusion Models

Diffusion models have diverse interpretations [28, 24, 2, 16, 15], but in this paper, we put aside the mathematical meaning and introduce the diffusion model in the most concise and general way. Diffusion models [26, 28, 10, 24, 8, 2] define a $T$ -step forward process and a $T$ -step reverse process. The forward process adds random noise to data, while the reverse process constructs desired data samples from the noise. Specifically, the forward process yields a noisy image $\mathbf{x}_{t}$ from a clean image $\mathbf{x}_{0}$ :

[TABLE]

where $t\sim\{0,...,T\}$ , $a_{t}$ and $\sigma_{t}$ are predefined scale factors, $\mathcal{N}$ represents the Gaussian distribution.

The core of the reverse process is estimating the clean image $\mathbf{x}_{0}$ from the noisy image $\mathbf{x}_{t}$ :

[TABLE]

which is a reverse of Eq. 1, with $\boldsymbol{\epsilon}_{t}$ denotes the estimation of noise $\boldsymbol{\epsilon}$ and $\mathbf{x}_{0|t}$ represents the estimation of $\mathbf{x}_{0}$ at time step $t$ . Typically, a denoiser $\mathcal{Z}_{\boldsymbol{\theta}}$ is used to yield $\boldsymbol{\epsilon}_{t}$ :

[TABLE]

Then we can use Eq. 1 to generate the previous state $\mathbf{x}_{t-1}$ , with $\mathbf{x}_{0|t}$ as the estimation of $\mathbf{x}_{0}$ :

[TABLE]

With the above formulations, one can generate a clean image $\mathbf{x}_{0}$ from a random noise $\mathbf{x}_{T}$$\sim$$\mathcal{N}(\mathbf{0},\mathbf{I})$ by iterating Eq. 2 and Eq. 4 while deceasing $t$ from $T$ to 0.

Such a reverse process is the simplest form. Further, for Eq. 4, we can interpolate the newly added noise $\boldsymbol{\epsilon}$ with the estimated previous noise $\boldsymbol{\epsilon}_{t}$ under the premise of invariant total variance:

[TABLE]

where $\eta_{t}$ is an interpolation factor that controls the ratio of the newly introduced noise $\boldsymbol{\epsilon}$ . Note that Eq. 5 describes a general form of reverse sampling methods. The critical difference between different sampling methods is the setting of $\eta_{t}$ . For DDIM [24], $\eta_{t}$ is a time-independent scalar; For DDPM [10] and Analytic-DPM [2], $\eta_{t}$ is a time-dependent function.

To train the denoiser $\mathcal{Z}_{\boldsymbol{\theta}}$ , one can randomly pick a clean image $\mathbf{x}_{0}$ from the dataset and pick a random time-step $t$ to yield a noisy image $\mathbf{x}_{t}$ using Eq. 1. Then, update the network parameters $\boldsymbol{\theta}$ with the following gradient descent step [10], and repeat the whole process until converged.

[TABLE]

2.2 Denoising Diffusion Null-space Model (DDNM)

Recent progress shows that pre-trained diffusion models can be used to solve linear inverse problems in a zero-shot manner [33, 17, 4, 27, 12], without extra training or optimization. DDNM [33] explains the nature of such methods.

DDNM starts with noise-free linear image inverse problems. Given a degraded image $\mathbf{y}=\mathbf{A}\mathbf{x}$ where $\mathbf{A}$ is a linear operator and $\mathbf{x}$ is the original image, image restoration aims at yielding a result $\hat{\mathbf{x}}$ that satisfies two constraints:

[TABLE]

where $q(\mathbf{x})$ denotes the distribution of the GT images.

Such a problem has a general solution that analytically satisfies the Consistency constraint:

[TABLE]

where $\mathbf{A^{\dagger}}$ is the pseudo-inverse of $\mathbf{A}$ (satisfies $\mathbf{A}\mathbf{A^{\dagger}}\mathbf{A}\equiv\mathbf{A}$ ), and $\mathbf{x}_{r}$ is the unknown null-space variable to be solved. Note that Eq. 8 originates from the Range-Null space Decomposition [33, 31, 3]. Another interpretation is that $\mathbf{A^{\dagger}}\mathbf{y}$ can be seen as a special solution of $\mathbf{A}\mathbf{x}=\mathbf{y}$ since $\mathbf{A}\mathbf{A^{\dagger}}\mathbf{y}\equiv\mathbf{A}\mathbf{A^{\dagger}}\mathbf{A}\mathbf{x}\equiv\mathbf{A}\mathbf{x}\equiv\mathbf{y}$ ; and $(\mathbf{I}-\mathbf{A^{\dagger}}\mathbf{A})\mathbf{x}_{r}$ can be seen as a general solution of $\mathbf{A}\mathbf{x}=\mathbf{0}$ since $\mathbf{A}(\mathbf{I}-\mathbf{A^{\dagger}}\mathbf{A})\mathbf{x}_{r}\equiv(\mathbf{A}-\mathbf{A})\mathbf{x}\equiv\mathbf{0}$ holds whatever $\mathbf{x}_{r}$ is.

To conclude, Eq. 8 defined a solution that analytically satisfies the Consistency constraint but needs to find proper null-space variable $\mathbf{x}_{r}$ to meet the Realness constraint. As we will get to later, the methods proposed in this paper heavily rely on the use of Eq. 8.

In DDNM [33], the critical step using diffusion models for inverse problems is taking each estimation $\mathbf{x}_{0|t}$ as the null-space variable $\mathbf{x}_{r}$ in Eq. 8:

[TABLE]

then use this consistent result ${\color[rgb]{0,0,1}\hat{\mathbf{x}}_{0|t}}$ for subsequent sampling:

[TABLE]

Algo. 1 shows the whole process of DDNM. See Appendix A for DDNM with noisy situations.

3 Method

We have introduced the basic principles of the diffusion model and DDNM. We can see that the limitation of the image processing size lies in the denoiser. Usually, the denoiser is pre-trained on fixed-size images. How do we use such pre-trained denoisers for unlimited-size image restoration? In the following part, we propose two methods to achieve this goal, both inherit the zero-shot property.

3.1 Process as a Whole Image

Typical diffusion models [26, 28, 10, 24, 8, 2] use U-Net structures [20] as the denoiser backbone. Theoretically, U-Net is a convolutional network and thus supports scalable input size.

Hence a simple solution is to directly change the model processing size. A similar approach has been widely adopted by Stable Diffusion [19] for flexible generated size. Despite supporting flexible input size, the denoiser trained on fixed image size may face Out-Of-Domain (OOD) problem when applied to other image sizes. As shown in Fig. 2, a diffusion model trained on CelebA 256 $\times$ 256 fails to generate desired 512 $\times$ 512 face images. One way to solve the OOD issue is to train the 256 $\times$ 256 denoiser with a random cropped dataset, rather than an aligned one. Interestingly, ImageNet and LAION-5B happen to be non-aligned datasets, and hence suffer relatively minor OOD issues.

3.2 Process as Patches

Directly changing the model processing size may work, but it still has the following limitations: (1) It may yield bad results when facing OOD problems, as shown in Fig. 2(b). (2) It still has limitations on image size, e.g., divisible by 32; (3) Large sizes, e.g., 1024 $\times$ 1024, may cause unaffordable memory consumption; (4) The classifier guidance [8] can not be applied since it is usually designed for fixed input sizes; (5) Other potential network backbones [18] may not support flexible processing size.

How to use diffusion models with fixed processing sizes to solve arbitrary image sizes? A simple solution is dividing the input image $\mathbf{y}$ into patches, solving each patch independently, then concatenating the results. But this may cause evident boundary artifacts, as shown in the middle of Fig. 1. This is because each patch is solved independently and their connection is not considered.

3.3 Mask-Shift Restoration

Among the many image restoration tasks, inpainting is the typical one that considers the connection between the masked and unmasked region. Zero-shot methods like DDNM [33] and RePaint [17] show good performance in solving inpainting.

Our insight is that we can leave overlapped regions when dividing patches, then take these overlapped regions as an extra constraint when solving the following patches. The neat thing is that this constraint can be integrated into existing zero-shot methods [33, 4, 27, 13, 12, 25, 17, 6, 7, 5], with just one extra line of code!

Let’s take a 4 $\times$ SR task for example, as shown in Fig. 3. Given an input image $\mathbf{y}^{full}$ with size 64 $\times$ 96, our aim is to get an SR result with size 256 $\times$ 384. Here we set the degradation operator $\mathbf{A}$ as the average-pooling downsampler, and its pseudo-inverse $\mathbf{A}^{\dagger}$ as the replication upsampler [31]. Fig. 3(a) shows the result of $\mathbf{A}^{\dagger}\mathbf{y}^{full}$ . We first divide $\mathbf{A}^{\dagger}\mathbf{y}^{full}$ into two square patches $\mathbf{A}^{\dagger}\dot{\mathbf{y}}$ and $\mathbf{A}^{\dagger}\mathbf{y}$ of size 256 $\times$ 256. Note that $\mathbf{A}^{\dagger}\dot{\mathbf{y}}$ and $\mathbf{A}^{\dagger}\mathbf{y}$ has an overlap of size 256 $\times$ 128.

We first use default DDNM to process $\mathbf{A}^{\dagger}\dot{\mathbf{y}}$ and get the SR result $\dot{\mathbf{x}}_{0}$ (Step 1 in Fig. 3). Note that $\mathbf{A}^{\dagger}\dot{\mathbf{y}}$ and $\mathbf{A}^{\dagger}\mathbf{y}$ has an overlap of size 256 $\times$ 128, and this overlapped region is already restored in $\dot{\mathbf{x}}_{0}$ . So when we use DDNM to solve $\mathbf{A}^{\dagger}\mathbf{y}$ , we can take the restored overlapped region as a known part in an inpainting setting (Step 2 in Fig. 3). Specifically, we insert an extra inpainting constraint behind Eq. 9 in DDNM:

[TABLE]

where $\mathbf{A}_{m}$ denotes the mask operator for overlapped region between $\mathbf{A}^{\dagger}\dot{\mathbf{y}}$ and $\mathbf{A}^{\dagger}\mathbf{y}$ . The whole algorithm is summarized in Algo. 2, named as Mask-Shift Restoration (MSR).

As we can see from Fig. 3(c), the final result concatenated by the results of Step 1 and Step 2 does not show boundary artifacts. Similarly, we can iteratively use MSR to generate an unlimited-size image without boundary artifacts. Note that the overlapped region and the shifted direction can be arbitrary, and the supported task is also not limited to SR, but to all linear inverse problems.

3.4 Hierarchical Restoration

Though MSR assures local coherence, it owns a small receptive field when dealing with a large image. This may lead to a lack of grasp of global information, resulting in poor semantic information recovery. In Fig. 4(a) we show a masked image of size 512 $\times$ 768, where any 256 $\times$ 256 patch can not cover the whole semantic subject. Fig. 4(b) shows the result using MSR based on DDNM. Though with good local coherence, it yields unreasonable semantic structures.

To extend the receptive field for better semantic restoration, we propose Hierarchical Restoration (HiR). HiR consists of two phases: a semantic restoration phase and a texture restoration phase.

Take Fig. 4(a) for example. For the semantic restoration phase, we first undergo a 2 $\times$ downsample to convert the 512 $\times$ 768 input into a 256 $\times$ 384 one, where a 256 $\times$ 256 patch can cover the whole semantic subject. Then we use MSR based on DDNM to get a 256 $\times$ 384 inpainting result $\ddot{\mathbf{x}}_{0}$ , as shown in Fig. 5(a). This result is semantically reasonable and can be used as a low-frequency reference. For the texture restoration phase (Fig. 5(b)), we add an extra low-frequency constraint before Eq. 9:

[TABLE]

where $\mathbf{A_{sr}}$ and $\mathbf{A_{sr}^{\dagger}}$ represent the average-pooling downsampler and its pseudo-inverse upsampler [31], respectively. Algo. 3 shows the whole algorithm of the second phase of HiR.

As we can see from Fig. 4(d), the use of HiR significantly improves semantic correctness. Note that the HiR is not limited to inpainting tasks, but is also useful for large-scale SR (Fig. 1(c)) and colorization (Fig. 7), etc.

3.5 Flexible Pipeline for Applications

Mask-Shift Restoration (MSR) can be seen as a general patch connection technology, and Hierarchical Restoration (HiR) can be seen as a general method to improve restoration quality. The essence of both MSR and HiR is to determine part of the information via prior knowledge to narrow the solution space. In this paper, we implement MSR and HiR via the Range-Null space Decomposition, which is concise, effective, and mathematically elegant. Besides, there remain other possible ways to implement MSR and HiR, e.g., adding extra loss into optimization-based methods such as DPS. Hence the proposed MSR and HiR can be also used for other diffusion-based zero-shot IR methods, e.g., ILVR[4], RePaint[17], and DPS[5].

4 Experiment

In this section, we describe the configuration of the experiment in detail. All experiments use the denoiser pre-trained on ImageNet 256 $\times$ 256, provided by guided-diffusion [8]. We use the classifier guidance [8] for sampling. Besides, the time-travel sampling [33, 17] is also used to improve the generative quality.

Given a desired result size, we divide it into patches from left to right, top to bottom. Each patch has a size 256 $\times$ 256 and has overlaps of 128 pixels with its neighbor patch, except for the boundary case. We solve the first patch using the original DDNM and solve the following patches in sequence (left to right, top to bottom) using MSR based on DDNM. Fig. 3 shows the results on 4 $\times$ SR, with $T=100$ , time-travel length [33] $l=10$ , repeat times $r=3$ . In Fig. 6, we present qualitative comparisons between BSRGAN [35] and MSR-based DDNM. We experiment on 4 $\times$ SR and noisy 4 $\times$ SR of different sizes, where MSR-based DDNM uses $T=250$ , $l=10$ , and $r=3$ . For Fig. 1(c), Fig. 4(d), and Fig. 7 we use HiR based on DDNM.

5 Related Work

Range-Null space Decomposition (RND) [29] is a concept in linear algebra. When applied to linear inverse problems, RND explicitly defines the upper limit of recoverable information. Chen et al. [3] introduce RND into image inverse problems, and propose learning the range and null space respectively. Wang et al. [31] propose using GAN Prior to learn the Null-space and propose using average-pooling and its pseudo-inverse as a general tool for SR tasks. In DDNM [33], the authors propose using diffusion sampling to learn the Null-space and propose several practical operators for diverse applications.

Diffusion-based Zero-Shot Image Restoration Methods can be roughly divided into RND-based [33, 4, 27, 13, 12, 25, 17] and optimization-based [6, 7, 5]. The essence of these two branches lies in modifying only the sampling process while keeping the network unchanged. Specifically, they modify the intermediate image $\mathbf{x}_{0|t}$ or its noisy version $\mathbf{x}_{t}$ . For a given input and a certain degradation operator, RND-based methods use RND to explicitly assure the data consistency of $\mathbf{x}_{0|t}$ or $\mathbf{x}_{t}$ , while optimization-based methods optimize $\mathbf{x}_{0|t}$ or $\mathbf{x}_{t}$ toward the data consistency. Generally speaking, the RND-based methods perform better in linear inverse problems but can not solve non-linear problems. The optimization-based methods cost more on memory and inference time but can support any differentiable operator, even as a complex network [1].

6 Limitations & Discussions

Zero-shot IR methods [33, 4, 27, 13, 12, 25, 17, 6, 7, 5] using diffusion models certainly open up a promising new direction for IR problems. The method proposed in this paper further enables those methods to support unlimited image size. However, there remain some limitations to be solved. Firstly, the calculation and time consumption are significantly more than those prevailing supervised methods. Secondly, the ceiling of performance depends on the pre-trained diffusion models. It may yield more interesting applications if applying our method to models like Imagen [22], but they are not open-sourced yet. On the other hand, wildly used models like Stable Diffusion [19] are based on latent space, which makes it difficult to apply zero-shot methods. Thirdly, the degradation operator is explicitly needed, which makes it difficult for tasks like rain and haze removal.

Another interesting observation is that MSR can be seen as a general image connection method, where we can use different models to restore special crops, e.g., use face restoration models [9, 30, 31, 32, 11] for face crops, then fuse them with the background using MSR to avoid boundary artifacts.

Appendix A DDNM for Noisy Image Restoration

For noisy inverse problem in the form $\mathbf{y}=\mathbf{A}\mathbf{x}+\mathbf{n}$ , $\mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma_{\mathbf{y}}^{2}\mathbf{I})$ , DDNM uses the denoiser $\mathcal{Z}_{\boldsymbol{\theta}}$ to eliminate the external noise $\mathbf{n}$ . To this end, DDNM involves two extra coefficients ${\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}$ and ${\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}$ , and turns Eq. 9 and Eq. 10 into

[TABLE]

The total noise distribution in $\mathbf{x}_{t-1}$ should be $\mathcal{N}(0,\sigma_{t-1}^{2}\mathbf{I})$ so that it can be removed by the denoiser $\mathcal{Z}_{\boldsymbol{\theta}}$ :

[TABLE]

Considering the variance equivalence:

[TABLE]

As shown in Eq. 17, the coefficients ${\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}$ and ${\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}$ are highly linearly coupled and are difficult to solve. So we need to use SVD to transform them into orthogonal space. The SVD of $\mathbf{A}$ and $\mathbf{A}^{\dagger}$ is:

[TABLE]

At the same time, we construct a special SVD for ${\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}$ and ${\color[rgb]{0,0,1}\mathbf{\Phi}_{t}}$ to further simplify Eq. 17.

[TABLE]

Then Eq. 17 becomes

[TABLE]

The below matrices in Eq. 22 are diagonal matrices:

[TABLE]

So Eq. 22 is actually the equation on its diagonal elements:

[TABLE]

To make sure Eq. 26 holds, we set

[TABLE]

To preserve the range-space information, we need $\mathbf{\Sigma}_{t}$ as close to $\mathbf{I}$ as possible. So we set

[TABLE]

In this way, we calculate the coefficients ${\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}$ and ${\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}$ , by which DDNM can well solve noisy inverse problems.

Note that in Eq. 14, the noise part can be also written as $\sigma_{t-1}{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}(\boldsymbol{\epsilon}+\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t})$ or $\sigma_{t-1}(\boldsymbol{\epsilon}+{\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}\sqrt{1-\eta_{t}^{2}}\boldsymbol{\epsilon}_{t})$ , if so, the calculation of ${\color[rgb]{0,0,1}\mathbf{\Sigma}_{t}}$ and ${\color[rgb]{0,0,1}\boldsymbol{\Phi}_{t}}$ will be different.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. ar Xiv preprint ar Xiv:2302.07121 , 2023.
2[2] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations (ICLR) , 2022.
3[3] Dongdong Chen and Mike E Davies. Deep decomposition learning for inverse imaging problems. In European Conference on Computer Vision (ECCV) . Springer, 2020.
4[4] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2021.
5[5] Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. International Conference on Learning Representations (ICLR) , 2023.
6[6] Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022.
7[7] Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Advances in Neural Information Processing Systems (Neur IPS) , 2022.
8[8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems (Neur IPS) , 34, 2021.