Image Super Resolution via Bilinear Pooling: Application to Confocal Endomicroscopy
Saeed Izadi, Darren Sutton, Ghassan Hamarneh

TL;DR
This paper introduces a novel attention mechanism combining first- and second-order statistics for super resolution of confocal endomicroscopy images, improving quality while maintaining efficiency for real-time clinical use.
Contribution
It proposes a new attention mechanism that integrates multiple statistical pooling methods, demonstrating competitive performance with fewer parameters.
Findings
Outperforms 11 existing super resolution methods on three datasets.
Achieves high PSNR, SSIM, and IFC metrics.
Is lightweight and suitable for real-time applications.
Abstract
Recent developments in image acquisition literature have miniaturized the confocal laser endomicroscopes to improve usability and flexibility of the apparatus in actual clinical settings. However, miniaturized devices collect less light and have fewer optical components, resulting in pixelation artifacts and low resolution images. Owing to the strength of deep networks, many supervised methods known as super resolution have achieved considerable success in restoring low resolution images by generating the missing high frequency details. In this work, we propose a novel attention mechanism that, for the first time, combines 1st- and 2nd-order statistics for pooling operation, in the spatial and channel-wise dimensions. We compare the efficacy of our method to 11 other existing single image super resolution techniques that compensate for the reduction in image quality caused by the…
| Methods | CLE100 | CLE200 | CLE1000 | time | ||||
|---|---|---|---|---|---|---|---|---|
| Scale 2 | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | ||
| Bicubic | 33.690.06 | 0.8693 | 35.530.01 | 0.9029 | 34.450.01 | 0.8920 | 0.02 | |
| A+✤ [19] | 34.220.07 | 0.8928 | 36.140.01 | 0.9218 | 35.040.01 | 0.9114 | 6.72 | |
| ANR✤ [18] | 36.440.13 | 0.9226 | 39.100.01 | 0.9559 | 37.640.01 | 0.9559 | 6.07 | |
| GR✤ [18] | 36.560.13 | 0.9243 | 39.260.01 | 0.9579 | 37.790.01 | 0.9448 | 4.47 | |
| SRCNN✝ [4] | 35.750.11 | 0.9181 | 38.250.01 | 0.9494 | 36.870.01 | 0.9380 | 0.06 | |
| VDSR✝ [11] | 36.720.13 | 0.9276 | 39.310.01 | 0.9578 | 37.890.01 | 0.9462 | 0.25 | |
| DRCN✝ [12] | 36.650.13 | 0.9257 | 39.290.01 | 0.9575 | 37.830.01 | 0.9452 | 0.48 | |
| LapSRN✝ [13] | 36.710.13 | 0.9264 | 39.250.01 | 0.9583 | 37.910.01 | 0.9462 | 0.07 | |
| SESR✝ [3] | 36.760.13 | 0.9282 | 39.360.01 | 0.9583 | 37.910.01 | 0.9462 | 0.27 | |
| RBAM (Ours)✝ | 36.910.12 | 0.9321 | 39.450.01 | 0.9590 | 38.220.01 | 0.9501 | 0.18 | |
| Scale 4 | ||||||||
| Bicubic | 31.290.04 | 0.6673 | 32.450.01 | 0.7318 | 31.780.01 | 0.7278 | 0.02 | |
| A+ [19] | 31.570.04 | 0.7042 | 32.760.01 | 0.7607 | 32.060.01 | 0.7517 | 3.03 | |
| ANR [18] | 31.680.04 | 0.7160 | 32.930.01 | 0.7736 | 32.230.01 | 0.7671 | 2.88 | |
| GR [18] | 31.700.04 | 0.7201 | 32.950.01 | 0.7736 | 32.250.01 | 0.7703 | 2.31 | |
| SRCNN [4] | 31.590.04 | 0.7073 | 32.760.01 | 0.7617 | 32.070.01 | 0.7566 | 0.06 | |
| VDSR [11] | 31.660.04 | 0.7144 | 32.860.01 | 0.7804 | 32.160.01 | 0.7635 | 0.25 | |
| DRCN [12] | 31.700.04 | 0.7214 | 32.920.01 | 0.7750 | 32.210.01 | 0.7635 | 0.48 | |
| LapSRN [13] | 31.680.04 | 0.7190 | 32.760.01 | 0.7617 | 32.290.01 | 0.7737 | 0.08 | |
| SESR [3] | 31.760.04 | 0.7249 | 32.990.01 | 0.7804 | 32.290.01 | 0.7737 | 0.33 | |
| RBAM (Ours) | 31.840.04 | 0.7315 | 33.110.01 | 0.7852 | 32.470.01 | 0.7874 | 0.07 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: School of Computing Science, Simon Fraser University, Canada
11email: {saeedi, darrens, hamarneh}@sfu.ca
Image Super Resolution via Bilinear Pooling: Application to Confocal Endomicroscopy
Saeed Izadi
Darren Sutton
and Ghassan Hamarneh
Abstract
Recent developments in image acquisition literature have miniaturized the confocal laser endomicroscopes to improve usability and flexibility of the apparatus in actual clinical settings. However, miniaturized devices collect less light and have fewer optical components, resulting in pixelation artifacts and low resolution images. Owing to the strength of deep networks, many supervised methods known as super resolution have achieved considerable success in restoring low resolution images by generating the missing high frequency details. In this work, we propose a novel attention mechanism that, for the first time, combines 1st- and 2nd-order statistics for pooling operation, in the spatial and channel-wise dimensions. We compare the efficacy of our method to 10 other existing single image super resolution techniques that compensate for the reduction in image quality caused by the necessity of endomicroscope miniaturization. All evaluations are carried out on three publicly available datasets. Experimental results show that our method can produce superior results against state-of-the-art in terms of PSNR, and SSIM metrics. Additionally, our proposed method is lightweight and suitable for real-time inference.
1 Introduction
Colorectal cancer is known as the fourth most-common cancer and remains one of the leading causes of cancer related mortality in the world. In 2018, more than 1 million people were affected by colorectal cancer worldwide, resulting in an estimated 550,000 deaths [2]. Rapid histopathologic assessment is an important tool that may improve disease prognosis by detecting early-stage cancer and pre-cancerous conditions. Although biopsy and ex-vivo tissue examination are widely accepted as the diagnostic gold standard, such procedures take time and may limit the ability of the endoscopist to rapidly gauge disease severity. Confocal laser endomicroscopy (CLE), on the other hand, has substantially improved real-time in-vivo visualization of the subsurface of living cells, vascular structures, and tissue patterns during endoscopic examination [10].
For in-vivo histological examination, the large size of the microscope complicates navigation of the interior of the body in a clinical setting. Therefore, it is necessary to reduce the size of the microscope to completely and safely access the organ(s) of interest. However, miniaturization reduces the number of optical elements in the microscope probe, introducing pixelation artifacts in the acquired images. One strategy to remove image artifacts and enhance image quality is to directly post-process degraded images. An emerging process in the field of image processing, referred to as single image super-resolution (SR), aims to reconstruct an accurate high-resolution (HR) image given its low-resolution (LR) counterpart. Thus, SR is a promising software method to mitigate image degradation due to hardware miniaturization.
Among traditional SR algorithms, Huang et al. [8] proposed leveraging self-similarity modulo affine transformations to accommodate natural deformation of recurring statistical priors within and across scales of an image. Timofte et al. [18, 19] used a combination of neighbour embedding and sparse dictionary learning over an external database and proposed anchored neighborhood regression in the dictionary atom space. Recently, CNNs have advanced the SR research field by directly learning the mapping between LR and HR images [4, 11, 12, 13, 1]. Dong et al. [4] demonstrated that a fully convolutional network trained end-to-end can perform LR-to-HR nonlinear mapping. Kim et al. [11] suggested a trained network to predict additive details in the form of a residual image, which is summed with the interpolated image. Kim et al. [12] addressed model overfitting by reducing the number of parameters via recursive convolutional layers. Lai et al. [13] designed a network which progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels. Ahn et al. [1] improved speed and efficiency of SR models by designing a cascade mechanism over residual networks. Lastly, Cheng et al. [3] exploited recursive squeeze and excitation modules in a network to exploit relationships between channels. Izadi et al. [9] reported the first attempt to deploy CNNs on CLE images. They used a densely connected CNN to transform synthetic LR images into HR ones. Ravi et al. [15] employed a CNN to restore missing details into LR images. They collected a set of consecutive LR frames and generated synthetic HR images using a video registration technique. In a more recent study [16], Ravi et al. trained a CNN for unsupervised SR on CLE images using a cycle consistency regularization, designed to impose acquisition properties on the super-resolved images.
In this paper, we present a lightweight convolutional neural network (CNN) that is appropriate for frame-wise SR by incorporating a novel attention mechanism. In contrast to SESR [3], which leverages attention modules from the Squeeze-and-Excitation network (SENet) [7] to re-weight channels, we introduce a novel weighting scheme to recalibrate learned features based on pairwise relationships. Our attention modules compromise both 1-order pooling and 2-order pooling (a.k.a. bilinear pooling), improving the quality of learned features in the network by considering pairwise correlations along feature channels and spatial regions [5]. The compactness and computational speed of our network lends well to real-time implementation during in-vivo examination. We demonstrate that stacking attention modules in the middle of a low-level feature extraction head and a feature integration tail quantitatively and qualitatively produces superior results against existing SR methods and generalizes well over unseen microscopic datasets.
2 Method
Network Overview. Fig. 1-a depicts the overall architecture of our proposed LR-to-SR network. Let , , and denote the low resolution input and super-resolved output, and the downsample factor, respectively. We use a convolution layer, denoted by , with a kernel and output channels to extract initial features , i.e.
[TABLE]
where refers to the learnable parameters. In our proposed network, the initial features are updated by sequential residual attention modules, denoted as and a skip connection. The entire high-level feature extraction stage is denoted as :
[TABLE]
To upsample the feature maps, we use sub-pixel convolutions, denoted as , followed by a single channel convolution for SR reconstruction:
[TABLE]
Residual Bilinear Attention Module. In our proposed RBAM, we combine 1- and 2-order pooling operations spatially and channel-wise to recalibrate learned features for efficient network training. Fig. 1-b illustrates the structure of our proposed RBAM. Mathematically, we formulate RBAM as:
[TABLE]
where denotes the attention modules before the skip connection. Given the input feature maps , two convolutions with kernel size interleaved with a ReLU activation function are performed to produce high-level feature maps as input to the attention branches:
[TABLE]
Channel-wise Attention (CA) Branch. CA leverages the inter-channel correspondence between feature responses (Fig. 1-c). 1- and 2-order pooling mechanisms operate on , producing two vectors , . is the 1-order CA obtained by spatial average pooling to squeeze the feature map of each channel [7]. To obtain 2-order CA, pairwise channel correlations are computed in the form of a covariance matrix by spatial flattening, dimension permutation, and matrix multiplication. Each row in encodes the statistical dependency of a channel with respect to every other channel [5]. Given the covariance matrix , we adopt a row-wise convolution with kernel size to produce the 2-order CA vector . Finally, two successive 1-D convolutions interleaved with a ReLU activation function operate on a vector formed by the sum of . The output of the convolution operation is fed into a sigmoid function , followed by element-wise multiplication to produce the bth updated features maps :
[TABLE]
Spatial Attention (SA) Branch. SA indicates shared correspondence between spatial regions across all feature maps (Fig. 1-d). Given as the input, the 1-order spatial attention matrix, , is computed by the average pooling operation along channel dimension to aggregate information for each spatial location across all features. To compute 2-order spatial attention matrix, , we first reduce the spatial size of feature maps to ( in our implementation) by applying average pooling. Then, appropriate reshaping, dimension permutation and matrix multiplication is adopted to obtain the covariance matrix . Similar to channel-wise attention, a row-wise convolution with kernel size is applied on . Eventually, dimension permutation and nearest neighbor interpolation produce . We add these two matrices together element-wise and apply a convolution with kernel size that feeds a sigmoid function. Spatial attention is realised by element-wise multiplication over all feature maps, formulated as:
[TABLE]
Attention Fusion. The updated features are concatenated () and aggregated via a convolution with kernel kernel. Lastly, is added via skip connection:
[TABLE]
3 Results and Discussion
Data. We evaluate existing state-of-the-art SR methods, as well as our proposed RBAM, on three publicly available CLE datasets (Table 1). We select images rich in texture by assessing the SR performance of bicubic interpolation on the unseen test set. As depicted in Fig. 3, images with PSNR scores below the mean PSNR score of the bicubic method evalulated on the test set are deemed ’texture rich’, and are used for evaluation, whereas images associated with scores above the mean are deemed ’texture poor’. In other words, images which can be effectively restored using bicubic interpolation are rejected for evaluation, as they contain little information on which to assess the performance of state-of-the-art methods. Evaluation assesses the methods’ ability to reconstruct HR image from a synthesized LR counterpart obtained via bicubic downsampling with the appropriate factor (2 or 4).
Training Settings. We train all methods on a random partition (80%) of CLE100, and evaluate them on the remaining 20% as well as CLE200, and CLE1000. For DL-based methods, we replicated the reported training settings, and used public code for traditional algorithms. For our model, we use RBAMs and set the number of features to to create a lightweight network. In each training batch, 16 LR patches of size are randomly extracted as inputs, and augmented by random 90*∘* rotations and horizontal/vertical flip. We use Adam optimizer and L1 loss to train our network for 300 epochs. Initial learning rate is set to and is halved every 50 epochs.
Ablation Investigation. We discern the effectiveness of the individual components in our network modules by ablating attention blocks and evaluating performance after 50 epochs. Our investigation shows that, for CLE100 at SR, attention-based variants outperform the baseline, demonstrating the merits of incorporating spatial and channel-wise contextual information. We also observed that using both 1 and 2-order pooling operations simultaneously outperform using either 1 or 2-order channel-wise pooling individually. We similarly note that using both spatial and channel-wise attention outperforms either one alone.
Comparison to State-of-the-art. We compare the performance of traditional algorithms including ANR [18], GR [18] and A+ [19], as well as DL-based techniques including SRCNN [4], VDSR [11], DRCN [12], LapSRN [13], SESR [3] and our proposed RBAM. Table 2 summarizes the quantitative comparisons in terms of peak signal-to-noise-ratio (PSNR-SEM), structural similarity (SSIM), and inference time at , and SR. From the table, one can see that most DL-based methods consistently outperform traditional SR algorithms in PSNR and SSIM metrics. Particularly, RBAM significantly outperforms the mean PSNR over all datasets by 0.18dB and 0.13dB for and SR, respectively. Furthermore, RBAM is a practical compromise between inference time, and generalization. Our results show a moderate quantitative increase in PSNR score and a considerable increase in qualitative performance - this is similar to previous works in single image super resolution [20]. Fig. 2 shows selected image patches from each dataset for qualitative assessment. RBAM can delicately restore high-frequency cues, such as granular textures and sudden changes in grayscale pixel intensity. This manifests qualitatively in the form of improved restoration of high frequency details such as cell membranes (CLE200, CLE1000 examples) and intracellular spaces (CLE100 example).
Motivation for Bilinear Pooling. We combine 1-order and 2-order pooling to recalibrate learned features based on channels that activate often or correspond to feature rich inputs, respectively. Channels that activate often are likely responding to common, low frequency image features. Conversely, channels that are highly correlated may be responding to feature rich instances in the image space that activate multiple filters simultaneously. High frequency features tend to be complex, and not as common semantically compared to low frequency image details. Therefore, channels that learn complex image features may not be emphasized by first order pooling operations alone. Combining first and second order pooling in an attention module assures that hard working channels are rewarded without diminishing the optimization of channels that learn complex features in the low to high resolution image mapping space.
4 Conclusion
We proposed the first network that simultaneously leverages both first and second order statistics for pooling in spatial and channel-wise attention mechanisms, resulting in a lightweight and fast model that restores high frequency image details. We compared our proposed model with various traditional and DL-based SR techniques on three CLE datasets in terms of image quality assessment metrics and inference time. Our RBAM network outperforms existing lightweight methods across different datasets, downsampling factors, and SR performance evaluation criteria. Experimental results also highlight the potential applicability of inexpensive software-based post-processing SR modules that improve degraded images in miniaturized CLE devices in real-time.
Acknowledgments. Thanks to the NVIDIA Corporation for the donation of Titan X GPUs used in this research and to the Collaborative Health Research Projects (CHRP) for funding.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Ahn et al. Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV , 2018.
- 2[2] F. Bray et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians , 68(6):394–424, 2018.
- 3[3] X. Cheng et al. Sesr: Single image super resolution with recursive squeeze and excitation networks. In IEEE ICPR , pages 147–152, 2018.
- 4[4] C. Dong et al. Image super-resolution using deep convolutional networks. IEEE TPAMI , 38(2):295–307, 2016.
- 5[5] Gao et al. Global second-order pooling neural networks. ar Xiv:1811.12006 , 2018.
- 6[6] E. Grisan et al. 239 computer aided diagnosis of barrett’s esophagus using confocal laser endomicroscopy: Preliminary data. Gast. Endosc. , 75(4,):AB 126, 2012.
- 7[7] J. Hu et al. Squeeze-and-excitation networks. In IEEE CVPR , 2018.
- 8[8] J. Huang et al. Single image super-resolution from transformed self-exemplars. In IEEE CVPR , pages 5197–5206, 2015.
