RSFDM-Net: Real-time Spatial and Frequency Domains Modulation Network   for Underwater Image Enhancement

Jingxia Jiang; Jinbin Bai; Yun Liu; Junjie Yin; Sixiang Chen; Tian Ye,; Erkang Chen

arXiv:2302.12186·cs.CV·February 24, 2023

RSFDM-Net: Real-time Spatial and Frequency Domains Modulation Network for Underwater Image Enhancement

Jingxia Jiang, Jinbin Bai, Yun Liu, Junjie Yin, Sixiang Chen, Tian Ye,, Erkang Chen

PDF

Open Access

TL;DR

RSFDM-Net is a real-time underwater image enhancement network that uses spatial and frequency domain modulation, incorporating novel modules like AFGM, MCAM, and TFE for improved color correction and detail preservation.

Contribution

The paper introduces RSFDM-Net, a novel real-time network with unique modules for enhanced underwater image quality, outperforming existing methods in both visual and quantitative assessments.

Findings

01

Outperforms state-of-the-art methods in visual quality

02

Achieves significant improvements in quantitative metrics

03

Effectively models global background and local textures

Abstract

Underwater images typically experience mixed degradations of brightness and structure caused by the absorption and scattering of light by suspended particles. To address this issue, we propose a Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for the efficient enhancement of colors and details in underwater images. Specifically, our proposed conditional network is designed with Adaptive Fourier Gating Mechanism (AFGM) and Multiscale Convolutional Attention Module (MCAM) to generate vectors carrying low-frequency background information and high-frequency detail features, which effectively promote the network to model global background information and local texture details. To more precisely correct the color cast and low saturation of the image, we introduce a Three-branch Feature Extraction (TFE) block in the primary net that processes images pixel by pixel to…

Tables3

Table 1. Table 1 : The size of image for testing is 1080 × \times 720. The results show that our network can enhance 720P images in real time. In addition, for the trade-off of the amount of parameters and computation, RSFDM-Net also lays a better solid foundation for practicality.

Method	PRWNet[28]	Shallow-UWnet[29]	UIEC^2-Net[14]	PUIE-Net[21]	Ours
#Param	6.30M	219.46K	534.96K	1.4M	110.95K
#GFLOPs	223.37G	304.75G	367.53G	423.05G	31.33G
#Runtime	0.216s	0.032s	0.174s	0.070s	0.039s

Table 2. Table 2 : Quantitative comparison of various approaches on the UIEB [ 26 ] datasets and U45 datasets [ 27 ] , using PSNR, SSIM, UCIQE, UIQM, and NIQE to evaluate the performance of all methods. Bold and underline indicate the best and second best metrics.↑ represents the higher is the better as well as ↓ represents the lower is the better.

	T90[26]					C60[26]			U45[27]
Method	PSNR↑	SSIM↑	MSE↓	UCIQE↑	UIQM↑	UCIQE↑	UIQM↑	NIQE↓	UCIQE↑	UIQM↑	NIQE↓
(ICCVW’13)UDCP [9]	13.415	0.749	0.228	0.572	2.755	0.560	1.859	5.897	0.574	2.275	4.144
(TIP’17)IBLA [10]	18.054	0.808	0.142	0.582	2.557	0.584	1.662	5.954	0.565	2.387	4.100
(TIP’19)WaterNet [36]	16.305	0.797	0.161	0.564	2.916	0.550	2.113	5.778	0.576	2.957	5.346
(TB’20)SMBL [11]	16.681	0.801	0.158	0.589	2.598	0.571	1.643	5.891	0.571	2.387	4.412
(PR’20)UWCNN [13]	17.949	0.847	0.221	0.517	3.011	0.492	2.222	5.837	0.527	3.063	4.187
(ICCVW’20)PRW-Net [28]	20.787	0.823	0.099	0.603	3.062	0.572	2.717	5.425	0.625	3.026	4.157
(AAAI’21)Shallow-UWnet [29]	18.278	0.855	0.131	0.544	2.942	0.521	2.212	5.874	0.545	3.109	4.241
(TIP’21)Ucolor [37]	21.093	0.872	0.096	0.555	3.049	0.530	2.167	6.298	0.554	3.148	4.687
(SPIC’21)UIEC^2-Net [14]	22.958	0.907	0.078	0.599	2.999	0.580	2.280	5.438	0.604	3.125	4.182
(TIP’22)MLLE [35]	19.561	0.845	0.115	0.592	2.624	0.581	1.977	4.925	0.597	2.454	4.725
(ECCV’22)PUIE-Net(MC) [21]	21.382	0.882	0.093	0.566	3.021	0.543	2.155	5.935	0.563	3.192	4.146
Ours	23.325	0.912	0.076	0.616	2.678	0.582	1.940	5.842	0.627	2.995	4.179

Table 3. Table 3 : Ablation studies on main modules of RSFDM-Net. IN and CA stand for Instance Normalization and Channel Attention module.

Setting	Model	PSNR↑	SSIM↑
$S 1$	$S 1$	22.374	0.874
ii	$S 1$ +GCN	22.729	0.891
iii	$S 1$ +IN	22.677	0.887
iv	$S 1$ +GCN+AFGM	23.101	0.896
v	$S 1$ +GCN+MCAM	23.065	0.903
vi	$S 1$ +GCN+CA	22.959	0.886
vii	RSFDM-Net(Ours)	23.325	0.912

Equations16

I_{c} (x) = J_{c} (x) t_{c} (x) + B_{c} (1 - t_{c} (x)), c \in {R, G, B}

I_{c} (x) = J_{c} (x) t_{c} (x) + B_{c} (1 - t_{c} (x)), c \in {R, G, B}

t_{c} (x) = e^{- β_{c} d (x)}

t_{c} (x) = e^{- β_{c} d (x)}

GS M (x_{i}) = γ * x_{i} + β, 0 < i \leq N

GS M (x_{i}) = γ * x_{i} + β, 0 < i \leq N

\begin{array}[]{l}{X_{g1i}}={f_{PW}}({f_{DWD}}({f_{DW}}({X_{ci}}))),i=1,2\\ {X_{g2i}}={f_{DW}}({X_{ci}}),i=1,2\\ \end{array}

\begin{array}[]{l}{X_{g1i}}={f_{PW}}({f_{DWD}}({f_{DW}}({X_{ci}}))),i=1,2\\ {X_{g2i}}={f_{DW}}({X_{ci}}),i=1,2\\ \end{array}

L_{r ec} = L_{c} (I (X), J_{g t})

L_{r ec} = L_{c} (I (X), J_{g t})

L_{c} = \frac{1}{N} i = 1 \sum N ∥ X^{i} - Y^{i} ∥^{2} + ϵ^{2}

L_{c} = \frac{1}{N} i = 1 \sum N ∥ X^{i} - Y^{i} ∥^{2} + ϵ^{2}

L_{p er ce pt u a l} = j = 1 \sum 2 \frac{1}{C _{j} H _{j} W _{j}} ∣∣ ϕ_{j} (I (x)) - ϕ_{j} (Y) ∣ ∣_{1}

L_{p er ce pt u a l} = j = 1 \sum 2 \frac{1}{C _{j} H _{j} W _{j}} ∣∣ ϕ_{j} (I (x)) - ϕ_{j} (Y) ∣ ∣_{1}

L = λ_{1} L_{c} + λ_{2} L_{p er ce pt u a l}

L = λ_{1} L_{c} + λ_{2} L_{p er ce pt u a l}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Enhancement Techniques · Image and Signal Denoising Methods · Advanced Image Processing Techniques

Full text

RSFDM-Net: Real-time Spatial and Frequency Domains Modulation Network for Underwater Image Enhancement

Abstract

Underwater images typically experience mixed degradations of brightness and structure caused by the absorption and scattering of light by suspended particles. To address this issue, we propose a Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for the efficient enhancement of colors and details in underwater images. Specifically, our proposed conditional network is designed with Adaptive Fourier Gating Mechanism (AFGM) and Multiscale Convolutional Attention Module (MCAM) to generate vectors carrying low-frequency background information and high-frequency detail features, which effectively promote the network to model global background information and local texture details. To more precisely correct the color cast and low saturation of the image, we introduce a Three-branch Feature Extraction (TFE) block in the primary net that processes images pixel by pixel to integrate the color information extended by the same channel (R, G, or B). This block consists of three small branches, each of which has its own weights. Extensive experiments demonstrate that our network significantly outperforms over state-of-the-art methods in both visual quality and quantitative metrics.

Index Terms—Spatial and Frequency Domains, Adaptive Fourier Gating, Multiscale Convolutional Attention, Underwater Image Enhancement

.

1 Introduction

Underwater imaging plays a significant role in underwater robotics [1], providing essential information for perceiving and understanding underwater environments. Recently, more and more works [2, 3, 4, 5, 6] have paid attention to realw-world image restoration problems with challenging degradations. According to the Jaffe-McGlamey imaging model [7, 8], underwater imaging consists of a linear superposition of direct, back scattered, and forward scattered components. In general, the effects of forward scattering are negligible, thus the imaging model can be simplified as:

[TABLE]

where $\mathbf{I}_{\mathbf{c}}(\mathbf{x})$ is the observed intensity in the color channel $c$ of the input image at the pixel $\mathbf{x}$ , $\mathbf{J}_{\mathbf{c}}(\mathbf{x})$ represents the restored image, $\mathbf{B_{\mathbf{c}}}$ represents the background light, and $t_{\mathbf{c}}(\mathbf{x})$ is the transmission map, where $c$ represents the red, green, and blue channels.

[TABLE]

where $d(x)$ is the distance from the camera to the radiant object, and $\beta_{\mathbf{c}}$ is the spectral volume attenuation coefficient. Original underwater images are often impacted by long-distance backscattering, selective absorption, and light scattering, resulting in low contrast, low brightness, significant chromatic aberration, blurred details, and uneven bright spots. Therefore, how to improve the visual quality of underwater images is a challenging task.

In the early years, some works[9, 10, 11, 12] were developed to address chromatic aberration and blurring effects in underwater images. However, these traditional methods lack the flexibility needed for implementation. For example, an inaccurate estimation of intermediate parameters or invalid underwater optical properties could lead to unsatisfactory results. To overcome this limitation, Li et al. [13] simulate realistic underwater images according to different water types and underwater imaging physical models. Wang et al. [14] propose a dual color space-based convolutional neural network for underwater image enhancement. Jiang et al. [15] design a novel domain-adaptive framework based on transfer learning to convert aerial image deblurring to real-world underwater image enhancement. Although these methods have achieved varying degrees of success, they have yet to design modules in the network to specifically deal with the color shift, low contrast (mainly at low frequencies), and loss of texture details (mainly at high frequencies), respectively. Huo et al. [16] use wavelet-enhanced learning units to decompose hierarchical features into high-frequency and low-frequency and subsequently enhanced them by normalization and attention mechanisms. While this method was highly successful, its larger amount of network parameters (6.3M) and computational requirements (0.22s to enhance a 720P underwater image) are not suitable for existing underwater equipment.

According to research [17], the task of underwater image enhancement has remarkable similarities with style transfer: both tasks are essentially concerned with changing the style (such as color) of an image while preserving its thematic content. However, compared to style transfer, underwater images pay more attention to the description of texture details, making the processing conditions even more demanding.

In this work, drawing inspiration from style transfer techniques, we propose the Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for efficient color and detail enhancement in underwater images. It comprises a conditional net with solid guidance and Global Style Modulation (GSM) that can modulate the style of different layers of the primary net. Due to the different light attenuation rates of different wavelengths in an underwater environment, the red light attenuates the fastest when propagating in water, and the blue-green light attenuates the slowest; thus, the difference between the R, G, and B channels of the underwater image is significant [18]. Previous methods [19, 20] only regard a single channel as the smallest unit of instance normalization, but they ignore the correlation between the channels: all the channels in the network are actually obtained by expanding on the three channels of R, G, and B. In response to this, we design TFE in the primary net that processes images pixel by pixel. The three-branch feature extraction block can more effectively capture the information of color features extended by R, G, or B channel. Many deep learning-based frameworks [13, 14, 21] exploit different kinds of attention modules in the spatial domain to capture high-frequency texture information of images while ignoring low-frequency information which is equally important in hybrid degradation. Inspired by this, we develop AFGM and MCAM based on Conditional Sequence Retouching Network (CSRNet) [22]. Unlike CSRNet [22] focusing on global retouching, the proposed targeted module can more efficiently and detailedly facilitate the network to model global background information and local texture details. Specifically, the conditional network uses AFGM to capture the global background information contained in the magnitude component and fuses with the multi-scale image details obtained by MCAM, which forces the conditional net to pay attention to both low-frequency contrast problems and loss of high-frequency detail. Finally, the conditional net output after average pooling is used as a conditional vector. The vector carrying low-frequency background information and high-frequency detail features are converted into learnable parameters by GSM, which are broadcast to different layers of the primary net to guidingly control image enhancement. Extensive experiments demonstrate that RSFDM-Net achieves a substantial improvement compared with state-of-the-art methods. RSFDM-Net also lays a better solid foundation for practicality, which can enhance underwater images of size 720P in real-time. Our main contributions in this paper are summarized as follows:

•

We propose a Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for efficient color and detail enhancement in underwater. Among it, the conditional vector is transformed into learnable parameters via global style modulation, which is broadcasted to different layers of the primary net to guidingly control the progressive enhancement of images.

•

We present a conditional net with AFGM and MCAM. This strongly guided net can generate conditional vectors carrying low-frequency background information and high-frequency detail features. It aims to more effectively promote our network to model global background information and local texture details.

•

We design TFE block to correct the color cast of underwater images by integrating the color information extended by the same channel (R, G, or B).

2 Methodology

Underwater images commonly experience mixed degradation in the form of color shift, low contrast (particularly at low frequencies), and lack of texture details (particularly at high frequencies). Our method is aimed at quickly and efficiently enhancing images with minimal computational cost. Drawing inspiration from style transfer techniques, we present a Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for effective enhancement of colors and details in underwater images.

2.1 Real-time Defined Underwater Image Enhancement Network

The whole pipeline of RSFDM-Net is illustrated in Fig.1. The proposed network structure contains a primary net and a conditional net. The primary net takes underwater degraded images as input and generates enhanced images, where TFE blocks help to correct the underwater images’ color cast. To further guide the primary net to better enhance degraded images, a conditional net is proposed to cooperate with the primary net. It can generate conditional vectors which carry low-frequency background information and high-frequency detail features through AFGM and MCAM, helping to more efficiently model the global background information and local texture details. GSM then converts the conditional vectors into learnable parameters, which are broadcasted to different layers of the primary net to guide underwater image enhancement. GSM adopts scaling and shifting operations to modulate intermediate style of the underwater image. The operation of GSM is described as follows:

[TABLE]

where $\gamma$ , $\beta$ are affine parameters.

2.2 Adaptive Fourier Gating Mechanism

Methods[13, 14, 21] restore the degraded image in the spatial domain alone. According to [23], processing information in Fourier space is capable of capturing the global frequency representation in the frequency domain. In contrast, the normal convolution focuses on learning local representations in the spatial domain. In this way, we propose AFGM illustrated in Fig.2 to learn more representative features. Specifically, the module separates the magnitude component carrying global information through a series of frequency domain operations, which include the Fourier transform.

2.3 Multi-scale Convolutional Attention Module

We hope the vector can store global information and texture details, so as to generate a conditional vector with abundant scene knowledge. As shown in Fig.3, MCAM is adopted in the conditional net for learning attention maps with degraded local details. Specifically, given the feature $X_{ci}$ after split operation, $X_{ci}\in{R^{C{\rm{\times}}H{\rm{\times}}W}}$ . The key process of MCAM is formulated as:

[TABLE]

where $f_{DW}(\cdot)$ is a $(2d-1)$ $\times$ $(2d-1)$ depth-wise convolution and $f_{DWD}(\cdot)$ is a [ $\frac{K}{d}\times\frac{K}{d}$ ] depth-wise d-dilation convolution. ${f_{PW}}(\cdot)$ represents a point-wise convolution.

2.4 Three-branch Feature Extraction

Previous methods [19, 20] ignore the correlation between channels: all channels in the network are derived from the three channels of R, G, and B. To address this limitation, we design the TFE block to better capture the distribution of color features on R, G, and B channels. As shown in Fig.4, the block has three small branches, and the weights are not shared between each branch. The effectiveness of this block is demonstrated in ablation study.

2.5 Loss Function

We introduce the Charbonnier Loss [24] as our basic reconstruction loss:

[TABLE]

where the $\mathcal{I}$ is our RSFDM-Net, $X$ and $\mathcal{J}_{gt}$ stand for input and ground-truth. $\mathcal{L}_{c}$ denotes the Charbonnier loss, which can be express as:

[TABLE]

where constant $\epsilon$ emiprically set to 1e-3 for all experiments. In addition, perceptual level of the restored image is also critical. We apply the perceptual loss to improve the restoration performance. The perceptual loss can be formulated as follows:

[TABLE]

wherein the $\phi_{j}$ represents the specified layer of VGG19 [25]. $C_{j}$ , $H_{j}$ , $W_{j}$ represent the channel number, height, and width of the feature map, relatively. Overall loss function can be expressed as:

[TABLE]

where the $\lambda_{1}$ and $\lambda_{2}$ are set to 1 and 0.2, relatively.

3 Experiments

3.1 Underwater image enhancement benchmark datasets

UIEB [26] contains 890 high-resolution raw underwater images, corresponding high-quality reference images, and 60 challenge images(C60) for which no corresponding reference images were obtained. The content of the images covers a wide range, such as marine life and divers, where the quality of underwater images significantly degenerate.

U45 [27] includes three degradation types of biased color, low contrast, and haze-like appearance, which has no corresponding reference images.

3.2 Experimental implementation details

Training Details. We implemented our network using PyTorch with a single RTX 3090 GPU. We trained our model for 400 epochs with a patch size of $512\times 512$ . The Adam optimizer was used with an initial learning rate of $1\times 10^{-4}$ . We employed 800 pairs of raw and sharp images selected from the UIEB [26] for training, and 90 images, called T90, for testing. We used CyclicLR to adjust the learning rate, with an initial momentum of 0.9 and 0.999. Data augmentation included horizontal flipping, random cropping and randomly rotating the image to $0,90,180$ and $270$ degrees.

Testing Detail. We use the T90 dataset to evaluate the performance of our model in terms of generalization. Additionally, we employ the C60 dataset, which consists of 60 challenge images with no corresponding reference images, as a part of our testing dataset. For further benchmarking, we utilize the U45 [27] dataset which contains three degradation types for testing our model.

Evaluation Metrics. For quantitative measurement, we use the peak signal-to-noise ratio (PSNR) [30], structural similarity index metric (SSIM) [31], underwater color image quality evaluation (UCIQE) [32], underwater image quality measure (UIQM) [33] and the Natural Image Quality Evaluator (NIQE) [34] as objective reference standards for image quality. The PSNR is a full-reference image quality evaluation metric, and it is based on the error between corresponding pixels. SSIM measures the visual impact of three features of an image: brightness, contrast, and structure.

Compared with SOTA Methods. We compare RSFDM-Net with state-of-the-art methods, including nondeep learning and deep learning methods. Nondeep learning methods include UDCP [9], IBLA [10], SMBL [11] and MLLE [35], and deep learning methods include UWCNN [13], Water-Net [36], PRW-Net [28], Shallow-UWnet [29], Ucolor [37], UIEC^2-Net [14] and the latest PUIE-Net [21] method [38]. We can observe that the proposed RSFDM-Net achieves the best results on PSNR and SSIM metrics in Table.LABEL:table:111, which prove our proposed method is good at handling details textures and restore contrast. Compared to the second best approach UIEC^2-Net [14], we exceed the 0.36dB and 0.05 on PSNR and SSIM. Our method also outperforms most methods in no-reference image quality assessment. We present parameters and performance indicators with previous SOTA methods in Table.LABEL:table:222. We also present the visual comparison with previous SOTA methods in Fig.5. It can be seen that our method can enhance the whole degraded images thoroughly, while the previous methods still have some local plaques.

3.3 Ablation Study

For ablation studies, we follow the basic settings presented above and conduct experiments to demonstrate the effectiveness of the components of our proposed comprehensive manner. The results are listed on the Table.LABEL:table:333. In the S1 setting, the Model consists of original CSR-Net[22]. Based on top of S1, we applied the other components to form different settings: ii (Apply TFE) and iii (Apply the Instance Normalization). It can be seen that the inclusion of TFE blocks improve the enhancing performance of the network better than IN. Secondly, the effectiveness of AFGM and MCAM are also verified by ablation study, as listed on the bottom rows of Table.LABEL:table:333. Based on setting ii, we extra apply AFGM in network as setting iv. Similarly, we have added MCAM as setting v and have applied CA as setting vi. We find AFGM greatly improves the enhancement performance in terms of PSNR. Additionally, MCAM proves more beneficial in restoring image details than when it is replaced with CA.

4 CONCLUSIONS

We propose a Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for efficient color and detail enhancement in underwater images. Moreover, We present a conditional net with Adaptive Fourier Gate Mechanism (AFGM) and Multiscale Convolutional Attention Module (MCAM) to more effectively promote the network to model global background information and local texture details. To detailedly correct the color cast and low saturation of the image, we design Three-branch Feature Extraction (TFE) block to achieve it by integrating the color information extended by the same channel. Extensive experiments on underwater datasets demonstrate the superiority of our RSFDM-Net.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Yu Wang, Chong Tang, Mingxue Cai, Jiye Yin, Shuo Wang, Long Cheng, Rui Wang, and Min Tan, “Real-time underwater onboard vision sensing system for robotic gripping,” IEEE Transactions on Instrumentation and Measurement , vol. 70, pp. 1–11, 2020.
2[2] Tian Ye, Yunchen Zhang, Mingchao Jiang, Liang Chen, Yun Liu, Sixiang Chen, and Erkang Chen, “Perceiving and modeling density for image dehazing,” in Computer Vision – ECCV 2022 , Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, Eds., Cham, 2022, pp. 130–145, Springer Nature Switzerland.
3[3] Yun Liu, Zhongsheng Yan, Aimin Wu, Tian Ye, and Yuche Li, “Nighttime image dehazing based on variational decomposition model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , June 2022, pp. 640–649.
4[4] Yun Liu, Zhongsheng Yan, Tian Ye, Aimin Wu, and Yuche Li, “Single nighttime image dehazing based on unified variational decomposition model and multi-scale contrast enhancement,” Engineering Applications of Artificial Intelligence , vol. 116, pp. 105373, 2022.
5[5] Tian Ye, Sixiang Chen, Yun Liu, Yi Ye, Jinbin Bai, and Erkang Chen, “Towards real-time high-definition image snow removal: Efficient pyramid network with asymmetrical encoder-decoder architecture,” in Proceedings of the Asian Conference on Computer Vision (ACCV) , December 2022, pp. 366–381.
6[6] Sixiang Chen, Tian Ye, Yun Liu, Taodong Liao, Yi Ye, and Erkang Chen, “Msp-former: Multi-scale projection transformer for single image desnowing,” ar Xiv preprint ar Xiv:2207.05621 , 2022.
7[7] BL Mc Glamery, “A computer model for underwater camera systems,” in Ocean Optics VI . SPIE, 1980, vol. 208, pp. 221–231.
8[8] Jules S Jaffe, “Computer modeling and the design of optimal underwater imaging systems,” IEEE Journal of Oceanic Engineering , vol. 15, no. 2, pp. 101–111, 1990.