WhiteNNer-Blind Image Denoising via Noise Whiteness Priors
Saeed Izadi, Zahra Mirikharaji, Mengliu Zhao, and Ghassan Hamarneh

TL;DR
This paper introduces a neural network model that leverages noise whiteness priors and image regularization techniques to effectively denoise medical images without requiring ground truth data, outperforming existing methods.
Contribution
The novel approach combines noise whiteness priors with traditional image priors in a neural network framework for blind image denoising without ground truth.
Findings
Outperforms Noise2Noise and Noise2Self in PSNR and SSIM
Effective on confocal laser endomicroscopy datasets
No ground truth data needed for training
Abstract
The accuracy of medical imaging-based diagnostics is directly impacted by the quality of the collected images. A passive approach to improve image quality is one that lags behind improvements in imaging hardware, awaiting better sensor technology of acquisition devices. An alternative, active strategy is to utilize prior knowledge of the imaging system to directly post-process and improve the acquired images. Traditionally, priors about the image properties are taken into account to restrict the solution space. However, few techniques exploit the prior about the noise properties. In this paper, we propose a neural network-based model for disentangling the signal and noise components of an input noisy image, without the need for any ground truth training data. We design a unified loss function that encodes priors about signal as well as noise estimate in the form of regularization terms.…
| \rowcolor[rgb] .851, .851, .851 Dataset | CLE100 | CLE200 | CLE1000 | |||||||||||||||
| \rowcolor[rgb] .851, .851, .851 Metric | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | ||||||||||||
| NLM [5] | ✓ | ✕ | ✕ | 23.49 | 0.3236 | 22.87 | 0.3318 | 23.40 | 0.3383 | |||||||||
| BM3D [7] | ✓ | ✕ | ✕ | 25.91 | 0.4152 | 24.74 | 0.4297 | 25.58 | 0.3877 | |||||||||
| N2S [2] | ✕ | ✕ | ✕ | 26.750.533 | 0.4467 | 25.120.521 | 0.4817 | 25.990.450 | 0.3989 | |||||||||
| N2N [17] | ✕ | ✕ | ✓ | 28.040.396 | 0.5070 | 26.000.297 | 0.5699 | 27.970.149 | 0.5139 | |||||||||
| N2T | ✕ | ✓ | ✕ | 28.310.268 | 0.5128 | 26.790.495 | 0.5790 | 28.290.362 | 0.5315 | |||||||||
| WhiteNNer-1 | ✕ | ✕ | ✕ | 25.750.275 | 0.4171 | 24.620.127 | 0.4627 | 25.630.401 | 0.3934 | |||||||||
| WhiteNNer-2 | ✕ | ✕ | ✕ | 27.050.184 | 0.4895 | 25.790.275 | 0.5301 | 26.580.245 | 0.4694 | |||||||||
| \rowcolor[rgb] .851, .851, .851 Loss | PSNR | |
| 22.41 | ||
| 23.14 | ||
| 25.88 | ||
| 26.02 | ||
| 26.12 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
affil0affil0affiliationtext: School of Computing Science, Simon Fraser University, Canada
{saeedi, zmirikha, mengliuz, hamarneh}@sfu.ca
WhiteNNer-Blind Image Denoising via Noise Whiteness Priors
Saeed Izadi
Zahra Mirikharaji
Mengliu Zhao
Ghassan Hamarneh
Abstract
The accuracy of medical imaging-based diagnostics is directly impacted by the quality of the collected images. A passive approach to improve image quality is one that lags behind improvements in imaging hardware, awaiting better sensor technology of acquisition devices. An alternative, active strategy is to utilize prior knowledge of the imaging system to directly post-process and improve the acquired images. Traditionally, priors about the image properties are taken into account to restrict the solution space. However, few techniques exploit the prior about the noise properties. In this paper, we propose a neural network-based model for disentangling the signal and noise components of an input noisy image, without the need for any ground truth training data. We design a unified loss function that encodes priors about signal as well as noise estimate in the form of regularization terms. Specifically, by using total variation and piecewise constancy priors along with noise whiteness priors such as auto-correlation and stationary losses, our network learns to decouple an input noisy image into the underlying signal and noise components. We compare our proposed method to Noise2Noise and Noise2Self, as well as non-local mean and BM3D, on three public confocal laser endomicroscopy datasets. Experimental results demonstrate the superiority of our network compared to state-of-the-art in terms of PSNR and SSIM.
1 Introduction
Globally, colorectal cancer is the third and the second most commonly diagnosed cancer in men and women, respectively. In 2018, 575,789 men and 520,812 women had a history of colorectal cancer, resulting in an estimated 551,269 deaths [3]. Timely inspection of suspicious areas within the gut, followed by a precise diagnosis, is critical for improved disease prognosis and reduced mortality. However, conventional tissue sectioning and ex-vivo histological examination are associated with invasive biopsy collection and preparation that significantly delay the screening procedure.
Handheld, portable confocal laser endomicroscopy (CLE) is a well-established imaging technique that provides a real-time, in-vivo, and biopsy-free histological assessment (so-called optical biopsy) of the mucosal layer of the gastrointestinal (GI) tract based on both endoscopic and endomicroscopic images acquired during an ongoing endoscopy. CLE provides magnified visualization of tissues in cellular and subcellular resolution and enables the endoscopists to assess the pathology in gastrointestinal tissue sites, e.g., Barrett’s esophagus [15] or colonic mucosa [14].
As the accuracy of medical imaging-based diagnostics depends on the quality of the images, it is critical to curtail imaging noise. One approach for noise reduction is to rely on higher quality image acquisition, be that through longer acquisition times, more elaborate optics, high fidelity electronics, or more complex image reconstruction algorithms. However, all these strategies are not without their drawbacks, i.e., subjecting the patient to lengthier procedures, less portable and more invasive imaging devices, increased cost, or lower frame rates, respectively. An alternative strategy to enhance the image quality is to directly post-process the corrupted images.
Due to the ill-posed nature of the image denoising problem, a large body of classic methods utilizes different regularization techniques, like incorporating additional domain-specific prior knowledge in the model to restrict the possible candidates in the optimization search space by penalizing solutions deemed undesirable [23, 7, 8, 5]. Despite the advantage of leveraging priors about the properties of the true signal, exploiting available information about the noise is often completely ignored. Conversely, the estimate of the clean image can be further improved once we also ensure that the noise estimate conforms to the known and expected properties of the true imaging noise as well.
Recently, convolutional neural networks (CNNs) have shown remarkable performance for image restoration, particularly image denoising [29, 30, 27]. These CNN-based approaches are typically trained in a supervised manner, requiring pairs of noisy and clean images (ground truth). However, in many applications, it is practically impossible to acquire the ground truth images, e.g., medical imaging of real patients. Lehtinen et al. [17] proposed the Noise2Noise training scheme to relax the requirement of providing ground truth by allowing the network to learn the mapping between two instances of the noisy image containing the same signal. Despite the efforts to remove the need for clean images in the training procedure, it is unfortunately still impractical to acquire several instances of the noisy image in some contexts, particularly for endomicroscopy. Alternatively, Batson et al. [2] proposed Noise2Self which is a self-supervised approach to denoise a corrupted image from only a single noisy instance.
In this paper, in addition to relying on explicit signal priors, we propose incorporating novel regularization terms into the CNN loss function thus encouraging the noise to respect the whiteness priors. Using these priors, we train our network to map the noisy images to the clean and noise components, without any ground truth. Moreover, instead of computing the noise by subtracting the predicted signal from the noisy input, we employ a neural architecture with two tails – one for inferring the clean image and the other for the noise. This design can serve as a new architectural regularization to facilitate the separability of signal and noise in a latent space, leading to the better reconstruction of both.
In a nutshell, the main contributions of the paper are summarized as follows:
We propose a novel blind denoising model that leverages both signal and noise priors in the form of regularization terms in a CNN loss function.
We present the first network that is capable of decoupling signal and noise without any ground truth.
We demonstrate that our proposed method outperforms state-of-the-art blind denoising techniques on three confocal endomicroscopy datasets.
2 Related Works
Most existing image denoising techniques in the literature leverage prior knowledge and we categorize them into the following three general groups:
Data-driven priors. Instead of explicit consideration, the approaches in this group encode the priors implicitly through learning the direct mapping between pairs of corrupted and clean images. Thanks to the strength of deep networks, recent denoising methods have achieved impressive results. The first attempt to leverage neural networks for image denoising was conducted by Burger et al. [6] who used a simple multi-layer perceptron architecture for the learning. Afterward, Zhang et al. [29] demonstrated that residual learning and batch normalization not only improve the denoising performance but also expedite the training procedure. Thai et al. [27] proposed a very deep memory persistent network by stacking dozens of memory units that are densely connected. Zhang et al. [30] suggested feeding a noise level map as well as the noisy image to the network. Doing so, the network will be able to handle different levels of corruption and spatially-variant noise. Most recently, the network proposed by Guo et al. [12] consists of one noise level estimator and one non-blind denoiser sub-network to robustly generalize the model performance on real-world noise.
Signal Priors. Approaches in this group, which encompasses the majority of the traditional image denoising techniques, exploit the hand-crafted priors about the underlying signal, such as smoothness [4], gradient [23], sparsity [9] and non-local self-similarity [7]. Classic approaches for image denoising can be traced back to the piecewise constant prior in the Mumford-Shah model, which was originally proposed for image segmentation [21], and then extended by Tsai tet al. [28] for image denoising. Later, Rudin et al. [23] exploited total variation (TV) prior in a variational formulation to preserve sharp edges of the underlying signal while suppressing the noise. Non-local self-similarity (NSS) prior relates to the fact that images often contain many similar yet non-local (i.e. spatially distant) patterns within the image. BM3D [7] is an NNS-based denoising method which groups similar patches, transforms the groups into the frequency domain and attenuates the noise by hard-thresholding of the transform coefficients. More recently, Ulyanov et al. [18] proposed to leverage the deep networks with randomly-initialized weights as a hand-crafted prior for image denoising. In their method, the network is trained to reconstruct the corrupted image from random noise input with an early stop constraint to prevent fitting to the noise. One can interpret their work as an NNS-based algorithm since a linear combination of spatially shared kernels is used to generate a clean image.
Noise Priors. There have been several attempts to exploit prior knowledge about spectral whiteness of the residual images in deblurring problem [1, 13, 24]. However, to the best of our knowledge, only a few have focused on image denoising. Lanza et al. [16] proposed to explicitly enforce whiteness property by imposing soft constraints on the auto-correlation of the residual image in the frequency domain. Recently, Soltanayev [25] leveraged the Gaussianity property of the AWGN to adopt Stein’s Unbiased Risk Estimator (SURE) as a loss function for training the deep networks without clean ground truth images.
3 Method
Image denoising refers to the process of inspecting the noisy image to decouple the underlying signal from the noise component. One common assumption is that the noise is additive white Gaussian (AWGN) with a standard deviation of . Let us consider to represent the noisy image formed by adding random white Gaussian noise to a clean signal where denotes total number of pixels. Letting to denote the index for a single pixel, the degradation model can be written as follows:
[TABLE]
[TABLE]
The goal of our model is to take a single noisy image and decouple it back into the signal and noise , without any ground truth. To do so, we need to design accurate and discriminative priors for either component and use them as the supervisory signal during the training.
3.1 Architecture and Inference
As depicted in Fig 2, our architecture embodies the encoder-decoder paradigm with skip connections [22] (removed in Fig 2 for simplicity). The encoder takes the noisy image where each pixel is normalized to and outputs two -dimensional latent features , , The features and are then decoded to the spatial domain by a shared decoder , resulting in signal and noise estimates. Accordingly, our network has a single input and two outputs. Once the model is trained, the latent representation for the noise can be simply discarded as it is only provided to serve as an architectural regularizer. In other words, the models maps the to both and while only is fed into the decoder to reconstruct signal .
We use the original U-Net [22] architecture for all our experiments with depth 5, kernel size 3, bilinear upsampling and linear activation in the last layer.
3.2 Proposed Loss Functions
Our model adopts a multi-loss objective function which encodes reconstruction, a.k.a data fidelity, and priors about the properties of the signal and noise. The proposed loss function can be formulated as follows:
[TABLE]
The noise prior consists of two distinct terms. The first term, auto-correlation loss , leverages the statistical fact that a white noise image contains pixels intensities which are spatially uncorrelated, given the signal. The second term, stationary loss , penalizes the network if the variance of the noise estimate is not consistent spatially. Hence, the loss term for noise priors can be written as:
[TABLE]
Turning to signal properties, we take two well-established priors into consideration: the piecewise constancy [21] and minimal total variation [23] denoted by and , respectively. Therefore, signal prior can be written as follows:
[TABLE]
Reconstruction Loss. Given the signal and noise estimates and in the output, guarantees that the addition of these two components perfectly reconstructs the original input. We use distance to measure the faithfulness over all pixels. Mathematically,
[TABLE]
Auto-correlation Loss. In statistics, auto-correlation (AC) is the metric to measure the correlation or similarity between a random process and a time-lagged version of itself. For a discrete random process , the AC at lag is expressed as:
[TABLE]
In context of 2D images, the lag can be considered in spatial domain across horizontal and vertical directions. Let denote an image in 2D space where and are the pixel coordinates. We can define the AC function as a mapping from pixel coordinates to a scalar value, formulated as follows:
[TABLE]
where and are the spatial lag between two distinct pixel coordinates across the horizontal and vertical axis, respectively. On the other hand, the whiteness property of the noise implies every pixel to be uncorrelated to any other pixel. Alternatively stated, given a white noise image, the AC at lag and equals to the noise variance while being zero elsewhere, i.e.:
[TABLE]
Moreover, we assume that the noise is ergodic and the image to be denoised is sufficiently large. Following our notation, sample auto-correlation can be used to estimate the AC for a noise image , defined as:
[TABLE]
To implement , we use sample AC and minimize it for any lag value greater than zero to penalize noise estimates that are spatially correlated. Precisely, we first pad each side of the noise prediction with a reflected copy of itself. Then, the sample AC is calculated by selecting a random lag in the range of for each update.
Stationary Loss. In general, a random process is said to be stationary if its statistics are not changed over time. White noise image is a simple example of a stationary process where the variance and mean are invariant to the spatial translation. Particularly for the variance statistic, we can mathematically define the stationary property of a noise as below:
[TABLE]
where denotes the translation shifts. To compel the satisfaction of stationary property, we first partition the predicted noise into non-overlapping blocks followed by computing the standard deviation within each block , resulting in a set . Being a stationary noise implies all elements of to be the identical, i.e.:
[TABLE]
Therefore, we apply a Softmax on the elements in to get a probability distribution over all blocks. Ideally, should give the same probability estimation for every block. To measure this, we compute by computing the cross-entropy between and a discrete uniform distribution over blocks. In our implementation, we randomly select from for each update.
Piecewise Constant Loss. Based on the prior suggested by Mumford Shah [21], is designed to encourage our model to output signal estimates that contain constant intensity values for all pixels within very small segments. To compute , we first need to simulate the piecewise constant counterpart of the noisy input. To do so, a graph-based segmentation [10] method is firstly used to partition the noisy image into pixel clusters , where refers to the set of indices of pixels that belong to the th cluster. We subsequently replace the values of pixels within each cluster with their average intensities, resulting in . Afterward, we measure by computing the distance between the gradients of the signal estimate and simulated piecewise constant image :
[TABLE]
where and denote the gradient operation across horizontal and vertical axes. The reason for comparing the gradients is obvious as we do not desire to encourage the network to produce pixel intensities of , but the smoothness property of the pixels within each cluster.
Total Variation Loss. regularizes the model to preserve large-scale edges and textures of the image while smoothing out the noise gradients. We use the standard formulation of total variation [23] in our implementation. Given the signal estimate , can be written as:
[TABLE]
4 Experiments
Implementation Details. Noisy images are obtained by synthetic contamination of the clean images with additive white Gaussian noise. During the training, we randomly crop images into patches of size and augment them by random 90*∘* rotations and horizontal/vertical flip, however, the quantitative evaluation is reported on the full-size images of the test set. All networks are trained for 500 epochs with a batch size of 16 using Adam optimizer with default parameters. The initial learning rate is set to and is halved every 100 epochs. For Noise2Noise and Noise2Self, we replicate the original training settings reported by the authors. We use publicly released codes to implement non-local mean (NLM) and BM3D denoising methods. It is noteworthy that all loss terms in the loss function have equal coefficients set to 1, except for the total variation term which is set to .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. S. C. Almeida and M. A. T. Figueiredo. Parameter estimation for blind and non-blind deblurring using residual whiteness measures. IEEE Transactions on Image Processing , 22(7):2751–2763, 2013.
- 2[2] J. Batson and L. Royer. Noise 2Self: Blind denoising by self-supervision. In International Conference on Machine Learning , volume 97, pages 524–533, 2019.
- 3[3] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians , 68(6):394–424, 2018.
- 4[4] A. Buades, B. Coll, and J. M. Morel. A review of image denoising algorithms, with a new one. Multiscale Modeling & Simulation , 4(2):490–530, 2005.
- 5[5] A. Buades, B. Coll, and J. M. Morel. Nonlocal image and movie denoising. International Journal of Computer Vision , 76(2):123–139, 2008.
- 6[6] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM 3D? In IEEE Conference on Computer Vision and Pattern Recognition , pages 2392–2399, 2012.
- 7[7] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Transactions on Image Processing , 16(8):2080–2095, 2007.
- 8[8] W. Dong, L. Zhang, G. Shi, and X. Li. Nonlocally centralized sparse representation for image restoration. IEEE Transactions on Image Processing , 22(4):1620–1630, 2013.
