Fast Underwater Image Enhancement for Improved Visual Perception

Md Jahidul Islam; Youya Xia; Junaed Sattar

arXiv:1903.09766·cs.CV·February 11, 2020

Fast Underwater Image Enhancement for Improved Visual Perception

Md Jahidul Islam, Youya Xia, Junaed Sattar

PDF

3 Repos

TL;DR

This paper introduces a real-time underwater image enhancement model using a conditional GAN, supported by a large-scale dataset, which improves visual quality and downstream task performance in underwater robotics.

Contribution

The paper presents a novel GAN-based model for underwater image enhancement and introduces EUVP, a large-scale dataset for training and evaluation.

Findings

01

Enhanced images improve object detection accuracy

02

Model performs well on both paired and unpaired training

03

Real-time processing suitable for underwater robots

Abstract

In this paper, we present a conditional generative adversarial network-based model for real-time underwater image enhancement. To supervise the adversarial training, we formulate an objective function that evaluates the perceptual image quality based on its global content, color, local texture, and style information. We also present EUVP, a large-scale dataset of a paired and unpaired collection of underwater images (of `poor' and `good' quality) that are captured using seven different cameras over various visibility conditions during oceanic explorations and human-robot collaborative experiments. In addition, we perform several qualitative and quantitative evaluations which suggest that the proposed model can learn to enhance underwater image quality from both paired and unpaired training. More importantly, the enhanced images provide improved performances of standard models for…

Tables3

Table 1. Table 1 : Quantitative comparison for average PSNR and SSIM values on 1 1 1 K paired test images of EUVP dataset.

Model	$P S N R (G (𝐱), 𝐲)$	$S S I M (G (𝐱), 𝐲)$
	Input: $17.27 \pm 2.88$	Input: $0.62 \pm 0.075$
Uw-HL	$18.85 \pm 1.76$	$0.7722 \pm 0.066$
Mband-EN	$12.11 \pm 2.55$	$0.4565 \pm 0.097$
Res-WGAN	$16.46 \pm 1.80$	$0.5762 \pm 0.014$
Res-GAN	$14.75 \pm 2.22$	$0.4685 \pm 0.122$
LS-GAN	$17.83 \pm 2.88$	$0.6725 \pm 0.062$
Pix2Pix	$20.27 \pm 2.66$	$0.7081 \pm 0.069$
UGAN-P	$19.59 \pm 2.54$	$0.6685 \pm 0.075$
CycleGAN	$17.14 \pm 2.65$	$0.6400 \pm 0.080$
FUnIE-GAN-UP	$21.36 \pm 2.17$	$0.8164 \pm 0.046$
FUnIE-GAN	$21.92 \pm 1.07$	$0.8876 \pm 0.068$

Table 2. Table 2 : Quantitative comparison for average UIQM values on 1 1 1 K paired and 2 2 2 K unpaired test images of EUVP dataset.

	Paired data	Unpaired data
Model	Input: $2.20 \pm 0.69$	Input: $2.29 \pm 0.62$
	G. Truth: $2.91 \pm 0.65$	G. Truth: N/A
Uw-HL	$2.62 \pm 0.35$	$2.75 \pm 0.32$
Mband-EN	$2.28 \pm 0.87$	$2.34 \pm 0.45$
Res-WGAN	$2.55 \pm 0.64$	$2.46 \pm 0.67$
Res-GAN	$2.62 \pm 0.89$	$2.28 \pm 0.34$
LS-GAN	$2.37 \pm 0.78$	$2.59 \pm 0.52$
Pix2Pix	$2.65 \pm 0.55$	$2.76 \pm 0.39$
UGAN-P	$2.72 \pm 0.75$	$2.77 \pm 0.34$
CycleGAN	$2.44 \pm 0.71$	$2.62 \pm 0.67$
FUnIE-GAN-UP	$2.56 \pm 0.63$	$2.81 \pm 0.65$
FUnIE-GAN	$2.78 \pm 0.43$	$2.98 \pm 0.51$

Table 3. Table 3 : Rank- n 𝑛 n accuracy ( n = 1 , 2 , 3 𝑛 1 2 3 n=1,2,3 ) for the top four models based on 312 312 312 responses provided by 78 78 78 individuals.

Model	Rank-1 ( $%$ )	Rank-2 ( $%$ )	Rank-3 ( $%$ )
FUnIE-GAN	$24.50$	$68.50$	$88.60$
FUnIE-GAN-UP	$18.67$	$48.25$	$76.18$
UGAN-P	$21.25$	$65.75$	$80.50$
Pix2Pix	$11.88$	$45.15$	$72.45$

Equations19

L_{c G A N} (G, D)

L_{c G A N} (G, D)

\displaystyle+\mathbb{E}_{X,Y}\big{[}\log(1-D(X,G(X,Z)))\big{]}

\centering\mathcal{L}_{1}(G)=\mathbb{E}_{X,Y,Z}\big{[}\big{|}\big{|}Y-G(X,Z)\big{|}\big{|}_{1}\big{]}\@add@centering

\centering\mathcal{L}_{1}(G)=\mathbb{E}_{X,Y,Z}\big{[}\big{|}\big{|}Y-G(X,Z)\big{|}\big{|}_{1}\big{]}\@add@centering

\centering\mathcal{L}_{con}(G)=\mathbb{E}_{X,Y,Z}\big{[}\big{|}\big{|}\Phi(Y)-\Phi(G(X,Z))\big{|}\big{|}_{2}\big{]}\@add@centering

\centering\mathcal{L}_{con}(G)=\mathbb{E}_{X,Y,Z}\big{[}\big{|}\big{|}\Phi(Y)-\Phi(G(X,Z))\big{|}\big{|}_{2}\big{]}\@add@centering

\centering G^{*} = G arg min D max L_{c G A N} (G, D) + λ_{1} L_{1} (G) + λ_{c} L_{co n} (G) \@add@centering

\centering G^{*} = G arg min D max L_{c G A N} (G, D) + λ_{1} L_{1} (G) + λ_{c} L_{co n} (G) \@add@centering

L_{cy c} (G_{F}, G_{R})

L_{cy c} (G_{F}, G_{R})

\displaystyle+\text{ }\mathbb{E}_{X,Y,Z}\big{[}\big{|}\big{|}Y-G_{F}(G_{R}(Y,Z))\big{|}\big{|}_{1}\big{]}

G_{F}^{*}, G_{R}^{*} =

G_{F}^{*}, G_{R}^{*} =

+ L_{c G A N} (G_{R}, D_{X}) + λ_{cy c} L_{cy c} (G_{F}, G_{R})

P S N R (x, y)

P S N R (x, y)

\footnotesize SSIM(\mathbf{x},\mathbf{y})=\Big{(}\frac{2\mathbf{\mu}_{\mathbf{x}}\mathbf{\mu}_{\mathbf{y}}+c_{1}}{\mathbf{\mu}_{\mathbf{x}}^{2}+\mathbf{\mu}_{\mathbf{y}}^{2}+c_{1}}\Big{)}\Big{(}\frac{2\mathbf{\sigma}_{\mathbf{xy}}+c_{2}}{\mathbf{\sigma}_{\mathbf{x}}^{2}+\mathbf{\sigma}_{\mathbf{y}}^{2}+c_{2}}\Big{)}

\footnotesize SSIM(\mathbf{x},\mathbf{y})=\Big{(}\frac{2\mathbf{\mu}_{\mathbf{x}}\mathbf{\mu}_{\mathbf{y}}+c_{1}}{\mathbf{\mu}_{\mathbf{x}}^{2}+\mathbf{\mu}_{\mathbf{y}}^{2}+c_{1}}\Big{)}\Big{(}\frac{2\mathbf{\sigma}_{\mathbf{xy}}+c_{2}}{\mathbf{\sigma}_{\mathbf{x}}^{2}+\mathbf{\sigma}_{\mathbf{y}}^{2}+c_{2}}\Big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Fast Underwater Image Enhancement for Improved Visual Perception

Md Jahidul Islam1, Youya Xia2 and Junaed Sattar3

{1islam034,2xiaxx244,3junaed}@umn.edu

Interactive Robotics and Vision Laboratory, Department of Computer Science and Engineering

Minnesota Robotics Institute, University of Minnesota, Twin Cities, MN, USA

Abstract

In this paper, we present a conditional generative adversarial network-based model for real-time underwater image enhancement. To supervise the adversarial training, we formulate an objective function that evaluates the perceptual image quality based on its global content, color, local texture, and style information. We also present EUVP, a large-scale dataset of a paired and an unpaired collection of underwater images (of ‘poor’ and ‘good’ quality) that are captured using seven different cameras over various visibility conditions during oceanic explorations and human-robot collaborative experiments. In addition, we perform several qualitative and quantitative evaluations which suggest that the proposed model can learn to enhance underwater image quality from both paired and unpaired training. More importantly, the enhanced images provide improved performances of standard models for underwater object detection, human pose estimation, and saliency prediction. These results validate that it is suitable for real-time preprocessing in the autonomy pipeline by visually-guided underwater robots. The model and associated training pipelines are available at https://github.com/xahidbuffon/funie-gan.

1 Introduction

Visually-guided AUVs (Autonomous Underwater Vehicles) and ROVs (Remotely Operated Vehicles) are widely used in important applications such as the monitoring of marine species migration and coral reefs [39], inspection of submarine cables and wreckage [5], underwater scene analysis, seabed mapping, human-robot collaboration [24], and more. One major operational challenge for these underwater robots is that despite using high-end cameras, visual sensing is often greatly affected by poor visibility, light refraction, absorption, and scattering [31, 45, 24]. These optical artifacts trigger non-linear distortions in the captured images, which severely affect the performance of vision-based tasks such as tracking, detection and classification, segmentation, and visual servoing. Fast and accurate image enhancement techniques can alleviate these problems by restoring the perceptual and statistical qualities [15, 45] of the distorted images in real-time.

As light propagation differs underwater (than in the atmosphere), a unique set of non-linear image distortions occur which are propelled by a variety of factors. For instance, underwater images tend to have a dominating green or blue hue [15] because red wavelengths get absorbed in deep water (as light travels further). Such wavelength dependant attenuation [2], scattering, and other optical properties of the waterbodies cause irregular non-linear distortions [18, 45] which result in low-contrast, often blurred, and color-degraded images. Some of these aspects can be modeled and well estimated by physics-based solutions, particularly for dehazing and color correction [7, 4]. However, information such as the scene depth and optical water-quality measures are not always available in many robotic applications. Besides, these models are often computationally too demanding for real-time deployments.

A practical alternative is to approximate the underlying solution by learning-based methods, which demonstrated remarkable success in recent years. Several models based on deep Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) provide state-of-the-art performance [21, 10, 25, 47] in learning to enhance perceptual image quality from a large collection of paired or unpaired data. For underwater imagery, in particular, a number of GAN-based models [15, 43] and CNN-based residual models [29] report inspiring progress for automatic color enhancement, dehazing, and contrast adjustment. However, there is significant room for improvement as learning perceptual enhancement for underwater imagery is a more challenging ill-posed problem (than terrestrial imagery). Additionally, due to the high costs and difficulties associated with acquiring large-scale underwater data, most learning-based models use small-scale and often only synthetically generated images that fail to capture a wide range of natural variability. Moreover, designing robust yet efficient image enhancement models and investigating their applicability for improving real-time underwater visual perception have not been explored in the literature in depth.

We attempt to address these challenges by designing a fast underwater image enhancement model and analyzing its feasibility for real-time applications. We formulate the problem as an image-to-image translation problem by assuming there exists a non-linear mapping between the distorted (input) and enhanced (output) images. Then, we design a conditional GAN-based model to learn this mapping by adversarial training on a large-scale dataset named EUVP (Enhancement of Underwater Visual Perception). From the perspective of its design, implementation, and experimental validation, we make the following contributions in this paper:

(a)

We present a fully-convolutional conditional GAN-based model for real-time underwater image enhancement, which we refer to as FUnIE-GAN. We formulate a multi-modal objective function to train the model by evaluating the perceptual quality of an image based on its global content, color, local texture, and style information. 2. (b)

Additionally, we present the EUVP dataset, a paired and an unpaired collection of $20$ K underwater images (of poor and good quality) that can be used for one-way and two-way adversarial training [10, 47]. The dataset is available at http://irvlab.cs.umn.edu/resources/euvp-dataset. 3. (c)

Furthermore, we present qualitative and quantitative performance evaluations compared to state-of-the-art models. The results suggest that FUnIE-GAN can learn to enhance perceptual image quality from both paired and unpaired training. More importantly, the enhanced images significantly boost the performance of several underwater visual perception tasks such as object detection, human pose estimation, and saliency prediction; a few sample demonstrations are highlighted in Fig. 1.

In addition to presenting the conceptual model of FUnIE-GAN, we analyze important design choices and relevant practicalities for its efficient implementation. We also conduct a user study and a thorough feasibility analysis to validate its effectiveness for improving the real-time perception performance of visually-guided underwater robots.

2 Related Work

2.1 Automatic Image Enhancement

Automatic image enhancement is a well-studied problem in the domains of computer vision, robotics, and signal processing. Classical approaches use hand-crafted filters to enforce local color constancy and improve contrast/lightness rendition [37]. Additionally, prior knowledge or statistical assumptions about a scene (e.g., haze-lines, dark channel prior [4], etc.) are often utilized for global enhancements such as image deblurring, dehazing [19], etc. Over the last decade, single image enhancement has made remarkable progress due to the advent of deep learning and the availability of large-scale datasets. The contemporary deep CNN-based models provide state-of-the-art performance for problems such as image colorization [44], color/contrast adjustment [11], dehazing [8], etc. These models learn a sequence of non-linear filters from paired training data, which provide much better performance compared to using hand-crafted filters.

Moreover, the GAN-based models [16] have shown great success for style-transfer and image-to-image translation problems [25]. They employ a two-player min-max game where the ‘generator’ tries to fool the ‘discriminator’ by generating fake images that appear to be sampled from the real distribution. Simultaneously, the discriminator tries to get better at discarding fake images and eventually (in equilibrium) the generator learns to model the underlying distribution. Although such adversarial training can be unstable, several tricks and choices of loss functions are proposed in the literature to mitigate that. For instance, Wasserstein GAN [3] improves the training stability by using the earth-mover distance to measure the distance between the data distribution and the model distribution. Energy-based GANs [46] also improve training stability by modeling the discriminator as an energy function, whereas the Least-Squared GAN [33] addresses the vanishing gradients problem by adopting a least-square loss function for the discriminator. On the other hand, conditional GANs [34] allow constraining the generator to produce samples that follow a pattern or belong to a specific class, which is particularly useful to learn a pixel-to-pixel (Pix2Pix) mapping [25] between an arbitrary input domain (e.g., distorted images) and the desired output domain (e.g., enhanced images).

A major limitation of the above-mentioned models is that they require paired training data, which may not be available or can be difficult to acquire for many practical applications. The two-way GANs (e.g., CycleGAN [47], DualGAN [42], etc.) solve this problem by using a ‘cycle-consistency loss’ that allows learning the mutual mappings between two domains from unpaired data. Such models have been effectively used for unpaired learning of perceptual image enhancement [10] as well. Furthermore, Ignatov et al. [21] showed that additional loss-terms for preserving the high-level feature-based content improve the quality of image enhancement using GANs.

2.2 Improving Underwater Visual Perception

Traditional physics-based methods use the atmospheric dehazing model to estimate the transmission and ambient light in a scene to recover true pixel intensities [12, 7]. Another class of methods design a series of bilateral and trilateral filters to reduce noise and improve global contrast [31, 45]. In recent work, Akkaynaket al. [2] proposed a revised imaging model that accounts for the unique distortions pertaining to underwater light propagation; this contributes to a more accurate color reconstruction and overall a better approximation to the ill-posed underwater image enhancement problem. Nevertheless, these methods require scene depth (or multiple images) and optical waterbody measurements as prior.

On the other hand, several single image enhancement models based on deep adversarial [15, 43, 28] and residual learning [29] have reported inspiring results of late. However, they typically use only synthetically distorted images for paired training, which often limit their generalization performance. The extent of large-scale unpaired training on naturally distorted underwater images have not been explored in the literature. Moreover, most existing models fail to ensure fast inference on single-board robotic platforms, which limits their applicability for improving real-time visual perception. We attempt to address these aspects in this paper.

3 Proposed Model and Dataset

3.1 FUnIE-GAN Architecture

Given a source domain $X$ (of distorted images) and desired domain $Y$ (of enhanced images), our goal is to learn a mapping $G:X\rightarrow Y$ in order to perform automatic image enhancement. We adopt a conditional GAN-based model where the generator tries to learn this mapping by evolving with an adversarial discriminator through an iterative min-max game. As illustrated in Fig. 2, we design a generator network by following the principles of U-Net [38]. It is an encoder-decoder network ( $e_{1}$ - $e_{5}$ , $d_{1}$ - $d_{5}$ ) with connections between the mirrored layers, i.e., between ( $e_{1}$ , $d_{5}$ ), ( $e_{2}$ , $d_{4}$ ), ( $e_{3}$ , $d_{2}$ ), and ( $e_{4}$ , $d_{4}$ ). Specifically, the outputs of each encoders are concatenated to the respective mirrored decoders. This idea of skip-connections in the generator network is shown to be very effective [25, 10, 15] for image-to-image translation and image quality enhancement problems. In FUnIE-GAN, however, we employ a much simpler model with fewer parameters in order to achieve fast inference. The input to the network is set to $256\times 256\times 3$ and the encoder ( $e_{1}$ - $e_{5}$ ) learns only $256$ feature-maps of size $8\times 8$ . The decoder ( $d_{1}$ - $d_{5}$ ) utilizes these feature-maps and inputs from the skip-connections to learn to generate a $256\times 256\times 3$ (enhanced) image as output. The network is fully-convolutional as no fully-connected layers are used. Additionally, 2D convolutions with $4\times 4$ filters are applied at each layer, which is then followed by a Leaky-ReLU non-linearity [32] and Batch Normalization (BN) [22]. The feature-map sizes in each layer and other model parameters are annotated in Fig. 2(a).

For the discriminator, we employ a Markovian PatchGAN [25] architecture that assumes the independence of pixels beyond the patch-size, i.e., only discriminates based on the patch-level information. This assumption is important to effectively capture high-frequency features such as local texture and style [42]. In addition, this configuration is computationally more efficient as it requires fewer parameters compared to discriminating globally at the image level. As shown in Fig. 2(b), four convolutional layers are used to transform a $256\times 256\times 6$ input (real and generated image) to a $16\times 16\times 1$ output that represents the averaged validity responses of the discriminator. At each layer, $3\times 3$ convolutional filters are used with a stride of $2$ ; then the non-linearity and BN are applied the same way as the generator.

3.2 Objective Function Formulation

A standard conditional GAN-based model learns a mapping $G:\{X,Z\}\rightarrow Y$ , where $X$ ( $Y$ ) represents the source (desired) domain, and $Z$ denotes random noise. The conditional adversarial loss function [34] is expressed as:

[TABLE]

Here, the generator $G$ tries to minimize $\mathcal{L}_{cGAN}$ while the discriminator $D$ tries to maximize it. In FUnIE-GAN, we associate three additional aspects, i.e., global similarity, image content, and local texture and style information in the objective to quantify perceptual image quality.

•

Global similarity: existing methods have shown that adding an $L_{1}$ ( $L_{2}$ ) loss to the objective function enables $G$ to learn to sample from a globally similar space in an $L_{1}$ ( $L_{2}$ ) sense [25, 43]. Since the $L_{1}$ loss is less prone to introduce blurring, we add the following loss term in the objective:

[TABLE]

•

Image content: we add a content loss term in the objective in order to encourage $G$ to generate enhanced image that has similar content (i.e., feature representation) as the target (i.e., real) image. Being inspired by [26, 21], we define the image content function $\Phi(\cdot)$ as the high-level features extracted by the block5_conv2 layer of a pre-trained VGG-19 network. Then, we formulate the content loss as follows:

[TABLE]

•

Local texture and style: as mentioned, Markovian PatchGANs are effective in capturing high-frequency information pertaining to the local texture and style [25]. Hence, we rely on $D$ to enforce the local texture and style consistency in adversarial fashion.

3.2.1 Paired Training

For paired training, we formulate an objective function that guides $G$ to learn to improve the perceptual image quality so that the generated image is close to the respective ground truth in terms of its global appearance and high-level feature representation. On the other hand, $D$ will discard a generated image that has locally inconsistent texture and style. Specifically, we use the following objective function for paired training:

[TABLE]

Here, $\lambda_{1}=0.7$ and $\lambda_{c}=0.3$ are scaling factors that we empirically tuned as hyper-parameters.

3.2.2 Unpaired Training

For unpaired training, we do not enforce the global similarity and content loss constraints as the pairwise ground truth is not available. Instead, the objective is to learn both the forward mapping $G_{F}:\{X,Z\}\rightarrow Y$ and the reconstruction $G_{R}:\{Y,Z\}\rightarrow X$ simultaneously by maintaining cycle-consistency. As suggested by Zhu et al. [47], we formulate the cycle-consistency loss as follows:

[TABLE]

Therefore, our objective for the unpaired training is:

[TABLE]

Here, $D_{Y}$ ( $D_{X}$ ) is the discriminator associated with the generator $G_{F}$ ( $G_{R}$ ), and the scaling factor $\lambda_{cyc}=0.1$ is an empirically tuned hyper-parameter. We do not enforce additional global similarity loss-term because the $\mathcal{L}_{cyc}$ computes analogous reconstruction loss for each domain in $L_{1}$ space.

3.3 EUVP Dataset

The EUVP dataset contains a large collection of paired and unpaired underwater images of poor and good perceptual quality. We used seven different cameras, which include multiple GoPros [17], Aqua AUV’s uEye cameras [14], low-light USB cameras [6], and Trident ROV’s HD camera [35], to capture images for the dataset. The data was collected during oceanic explorations and human-robot cooperative experiments in different locations under various visibility conditions. Additionally, images extracted from a few publicly available YouTube™videos are included in the dataset. The images are carefully selected to accommodate a wide range of natural variability (e.g., scenes, waterbody types, lighting conditions, etc.) in the data.

The unpaired data is prepared, i.e., good and poor quality images are separated based on visual inspection by six human participants. They inspected several image properties (e.g., color, contrast, and sharpness) and considered whether the scene is visually interpretable, i.e., foreground/objects are identifiable. Hence, the unpaired training endorses the modeling of human perceptual preferences of underwater image quality. On the other hand, the paired data is prepared by following a procedure suggested in [15]. Specifically, a CycleGAN [47]-based model is trained on our unpaired data to learn the domain transformation between the good and poor quality images. Subsequently, the good quality images are distorted by the learned model to generate respective pairs; we also augment a set of underwater images from the ImageNet dataset [13] and from Flickr™.

There are over $12$ K paired and $8$ K unpaired instances in the EUVP dataset; a few samples are provided in Fig. 3. It is to be noted that our focus is to facilitate perceptual image enhancement for boosting robotic scene understanding, not to model the underwater optical degradation process for image restoration, which requires scene depth and waterbody properties.

4 Experimental Results

We use TensorFlow libraries [1] to implement the FUnIE-GAN model. It is trained separately on $11$ K paired and $7.5$ K unpaired instances; the rest are used for respective validation and testing. Four NVIDIATM GeForce GTX 1080 graphics cards are used for training; both models are trained for $60$ K- $70$ K iterations with a batch-size of $8$ . We now present the experimental evaluations based on a qualitative analysis, standard quantitative metrics, and a user study.

4.1 Qualitative Evaluations

We first qualitatively analyze the enhanced color and sharpness of the FUnIE-GAN-generated images compared to their respective ground truths. As Fig. 4(a) shows, the true color, and sharpness is mostly recovered in the enhanced images. Additionally, as shown in Fig. 4(b), the greenish hue in underwater images are rectified and the global contrast is enhanced. These are the primary characteristics of an effective underwater image enhancer. We further demonstrate the contributions of each loss-terms of FUnIE-GAN: global similarity loss ( $\mathcal{L}_{1}$ ), and image content loss ( $\mathcal{L}_{con}$ ), for learning the enhancement. We observe that the $\mathcal{L}_{1}$ term helps to generate sharper images, while the $\mathcal{L}_{con}$ term contributes to furnishing finer texture details (see Fig. 4(c)). Moreover, we found slightly better numeric stability for $\mathcal{L}_{con}$ with the block5_conv2 layer of VGG-19 compared to its last feature extraction layer (block5_conv4).

Next, we conduct a qualitative comparison of perceptual image enhancement by FUnIE-GAN with several state-of-the-art models. We consider five learning-based models: (i) underwater GAN with gradient penalty (UGAN-P [15]), (ii) Pix2Pix [25], (iii) least-squared GAN (LS-GAN [33]), (iv) GAN with residual blocks [27] in the generator (Res-GAN), and (v) Wasserstein GAN [3] with residual blocks in the generator (Res-WGAN). These models are implemented with $8$ encoder-decoder pairs (or $16$ residual blocks) in the generator network and $5$ convolutional layers in the discriminator. They are trained on the paired EUVP dataset using the same setup as the FUnIE-GAN. Additionally, we consider CycleGAN [47] as a baseline for comparing the performance of FUnIE-GAN with unpaired training (i.e., FUnIE-GAN-UP). We also include two physics-based models in the comparison: Multi-band fusion-based enhancement (Mbad-EN [12]), and haze-line-aware color restoration (Uw-HL [4]). A common test set with $1$ K images (of $256\times 256$ resolution) are used for the qualitative evaluation; it also includes $72$ images with known waterbody types [4]. A few sample comparisons are illustrated in Fig. 5.

As demonstrated in Fig. 5, Res-GAN, Res-WGAN, and Mbad-EN often suffer from over-saturation, while LS-GAN generally fails to rectify the greenish hue in images. UGAN-P, Pix2Pix, and Uw-HL perform reasonably well and their enhanced images are comparable to that of FUnIE-GAN; however, UGAN-P often over-saturates bright objects in the scene while Pix2Pix fails to enhance global brightness in some cases. On the other hand, we observe that achieving color consistency and hue rectification are relatively more challenging through unpaired learning. This is mostly because of the lack of reference color or texture information in the loss function. Nevertheless, FUnIE-GAN-UP still outperforms CycleGAN in general. Overall, FUnIE-GAN performs as well and often better without using scene depth or prior waterbody information as the physics-based models, and despite having a much simpler network architecture compared to the existing learning-based models.

4.2 Quantitative Evaluation

We consider two standard metrics [21, 10, 20] named Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) in order to quantitatively compare FUnIE-GAN-enhanced images with their respective ground truths. The PSNR approximates the reconstruction quality of a generated image $\mathbf{x}$ compared to its ground truth $\mathbf{y}$ based on their Mean Squared Error (MSE) as follows:

[TABLE]

On the other hand, the SSIM [41] compares the image patches based on three properties: luminance, contrast, and structure. It is defined as follows:

[TABLE]

In Eq. 6, $\mathbf{\mu}_{\mathbf{x}}$ ( $\mathbf{\mu}_{\mathbf{y}}$ ) denotes the mean, and $\mathbf{\sigma}_{\mathbf{x}}^{2}$ ( $\mathbf{\sigma}_{\mathbf{y}}^{2}$ ) is the variance of $\mathbf{x}$ ( $\mathbf{y}$ ); whereas $\mathbf{\sigma}_{\mathbf{xy}}$ denotes the cross-correlation between $\mathbf{x}$ and $\mathbf{y}$ . Additionally, $c_{1}=(255\times 0.01)^{2}$ and $c_{2}=(255\times 0.03)^{2}$ are constants that ensure numeric stability.

In Table 1, we provide the averaged PSNR and SSIM values over $1$ K test images for FUnIE-GAN and compare the results with the same models used in the qualitative evaluation. The results indicate that FUnIE-GAN performs best on both PSNR and SSIM metrics. We conduct a similar analysis for Underwater Image Quality Measure (UIQM) [36, 30], which quantifies underwater image colorfulness, sharpness, and contrast. We present the results in Table 2, which indicates that although FUnIE-GAN-UP performs better than CycleGAN, its UIQM values on the the paired dataset are relatively poor. Interestingly, the models trained on paired data, particularly FUnIE-GAN, UGAN-P, and Pix2Pix, produce better results. We postulate that the global similarity loss in FUnIE-GAN and Pix2Pix, or the gradient-penalty term in UGAN-P contribute to this enhancement, as they all add $L_{1}$ terms in the adversarial objective. Our ablation experiments of FUnIE-GAN (see Fig. 4(c)) reveal that the $\mathcal{L}_{1}$ loss-term contributes to $4.58\%$ improvements in UIQM, while $\mathcal{L}_{con}$ contributes $1.07\%$ . Moreover, without both $\mathcal{L}_{1}$ and $\mathcal{L}_{con}$ loss-terms, the average UIQM values drop by $17.6\%$ ; we observe similar statistics for PSNR and SSIM as well.

4.3 User Study

We also conduct a user study to add human preferences to our qualitative performance analysis. The participants are shown different sets of $9$ images (one for each learning-based models) and asked to rank top $3$ best quality images. A total of $78$ individuals participated in the study and a total of $312$ responses are recorded. Table 3 compares the average rank-1, rank-2, and rank-3 accuracy of the top $4$ categories. The average rank-3 accuracy of the original images is recorded to be $6.67$ , which suggests that the users clearly preferred enhanced images over the original ones. Moreover, the results indicate that the users prefer the images enhanced by FUnIE-GAN, UGAN-P, and Pix2Pix compared to the other models; these statistics are consistent with our qualitative and quantitative analysis.

4.4 Improved Visual Perception

As demonstrated in Fig. 6(a), we conduct further experiments to quantitatively interpret the effectiveness of FUnIE-GAN-enhanced images for underwater visual perception over a variety of test cases. We analyze the performance of standard deep visual models for underwater object detection [23], 2D human body-pose estimation [9], and visual attention-based saliency prediction [40]; although results vary depending on the image qualities of a particular test set, on an average, we observe $11$ - $14\%$ , $22$ - $28\%$ , and $26$ - $28\%$ improvements, respectively. We also evaluate other state-of-the-art models on the same test sets; as Fig. 6(b) suggests, images enhanced by UGAN-P, Res-GAN, Res-WGAN, Uw-HL, and Pix2Pix also achieve considerable performance improvements. However, these models offer significantly slower inference-rates than FUnIE-GAN, most of which are not suitable for real-time deployment in robotic platforms.

FUnIE-GAN’s memory requirement is $17$ MB and it operates at a rate of $25.4$ FPS (frames per second) on a single-board computer (NVIDIA™Jetson TX2), $148.5$ FPS on a graphics card (NVIDIA™GTX 1080), and $7.9$ FPS on a robot CPU (Intel™Core-i3 6100U). These computational aspects are ideal for it to be used as an image processing pipeline by visually-guided underwater robots in real-time applications.

4.5 Limitations and Failure Cases

We observe a couple of challenging cases for FUnIE-GAN, which are depicted by a few examples in Fig. 7. First, FUnIE-GAN is not very effective for enhancing severely degraded and texture-less images. The generated images in such cases are often over-saturated by noise amplification. Although the hue rectification is generally correct, the color and texture recovery remains poor. Secondly, FUnIE-GAN-UP is prone to training instability. Our investigations suggest that the discriminator often becomes too good too early, causing a diminishing gradient effect that halts the generator’s learning. As shown in Fig. 7 (right), the generated images in such cases lack color consistency and accurate texture details. This is a fairly common issue in unpaired training of GANs [10, 21, 26], and requires meticulous hyper-parameter tuning.

FUnIE-GAN balances a trade-off between robustness and efficiency which limits its performance to a certain degree. More powerful deep models (i.e., denser architectures with more parameters) can be adopted for non-real-time applications; moreover, the input/output layers can be modified with additional bottleneck layers for learning enhancement at higher resolution than $256\times 256$ . On the other hand, FUnIE-GAN does not guarantee the recovery of true pixel intensities as it is designed for perceptual image quality enhancement. If scene depth and optical waterbody properties are available, underwater light propagation and image formation characteristics [2, 4, 7] can be incorporated into the optimization for more accurate image restoration.

5 Conclusion

We present a simple yet efficient conditional GAN-based model for underwater image enhancement. The proposed model formulates a perceptual loss function by evaluating image quality based on its global color, content, local texture, and style information. We also present a large-scale dataset containing a paired and an unpaired collection of underwater images for supervised training. We perform extensive qualitative and quantitative evaluations, and conduct a user study which show that the proposed model performs as well and often better compared to the state-of-the-art models, in addition to ensuring much faster inference time. Moreover, we demonstrate its effectiveness in improving underwater object detection, saliency prediction, and human body-pose estimation performances. In the future, we plan to investigate its feasibility in other underwater human-robot cooperative applications, marine trash identification, etc. We seek to improve its color consistency and stability for unpaired training as well.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al. Tensor Flow: A System for Large-scale Machine Learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI) , pages 265–283, 2016.
2[2] D. Akkaynak and T. Treibitz. A Revised Underwater Image Formation Model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6723–6732, 2018.
3[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning (ICML) , pages 214–223, 2017.
4[4] D. Berman, D. Levy, S. Avidan, and T. Treibitz. Underwater Single Image Color Restoration using Haze-Lines and a New Quantitative Dataset. ar Xiv preprint ar Xiv:1811.01343 , 2018.
5[5] B. Bingham, B. Foley, H. Singh, R. Camilli, K. Delaporta, R. Eustice, et al. Robotic Tools for Deep Water Archaeology: Surveying an Ancient Shipwreck with an Autonomous Underwater Vehicle. Journal of Field Robotics (JFR) , 27(6):702–717, 2010.
6[6] Blue Robotics. Low-light HD USB Camera. https://www.bluerobotics.com/ , 2016. Accessed: 3-15-2019.
7[7] M. Bryson, M. Johnson-Roberson, O. Pizarro, and S. B. Williams. True Color Correction of Autonomous Underwater Vehicle Imagery. Journal of Field Robotics (JFR) , 33(6):853–874, 2016.
8[8] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehaze Net: An End-to-end System for Single Image Haze Removal. IEEE Transactions on Image Processing , 25(11):5187–5198, 2016.