MLCANet: Multi-Level Composite Attention-Guided Network for Non-Homogeneous Image Dehazing in Adverse Weather Conditions

Yongsheng Qiu

PMC · DOI:10.3390/s26051505·February 27, 2026

MLCANet: Multi-Level Composite Attention-Guided Network for Non-Homogeneous Image Dehazing in Adverse Weather Conditions

Yongsheng Qiu

PDF

Open Access

TL;DR

This paper introduces MLCANet, a new deep learning method for improving image clarity in foggy or adverse weather conditions.

Contribution

The novel MLCANet uses multi-level attention mechanisms to better handle non-homogeneous fog distributions in image dehazing.

Findings

01

MLCANet integrates channel, spatial, and multi-scale pixel attention to capture haze features effectively.

02

The proposed method outperforms existing approaches in restoring fine image details from foggy scenes.

03

Experiments and ablation studies confirm the effectiveness of MLCANet in non-homogeneous image dehazing.

Abstract

Image dehazing is a challenging ill-posed problem in low-level computer vision tasks, requiring the restoration of high-quality, haze-free images from complex and foggy conditions. Deep learning-based dehazing methods struggle to effectively remove non-homogeneous fog distributions due to the uneven and dense nature of fog patches, making it difficult to clear real-world fog variations. A key challenge for non-homogeneous image dehazing algorithms is efficiently capturing the spatial distribution of haze in areas with varying fog densities while restoring fine image details. To address these challenges, we propose MLCANet, a multi-level composite attention-guided network for non-homogeneous image dehazing. MLCANet mitigates the impact of uneven haze areas through two main components: the Multi-level Composite Attention Generation Network (MCAGN) and the Dehazed Image Reconstruction…

Figures7

Click any figure to enlarge with its caption.

Funding3

—Kashi University High-Level Talent Recruitment Research Start-up Project
—Xinjiang Key Laboratory of Multimodal Intelligent Computing
—Large Models, and the Program for Innovative Research Team at Kashi University

Keywords

image dehazingmulti-level composite attentionnon-homogeneous fog distributionmulti-scale deformable convolution

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Enhancement Techniques · Visual Attention and Saliency Detection · Random lasers and scattering media

Full text

1. Introduction

Image dehazing is a classic ill-posed problem in low-level computer vision tasks, aiming to eliminate haze interference that degrades image visibility in real-world scenes, thereby enhancing image quality. Haze results from the absorption and scattering of light by suspended particles in the atmosphere, negatively affecting both the imaging quality of visible light sensors and the performance of vision-based applications, such as autonomous driving [1], environmental monitoring [2], and remote sensing [3]. Related degradation phenomena such as low-light conditions also pose challenges for image quality restoration, and have been actively studied in parallel research communities [4].

Traditional image dehazing methods typically rely on the Atmospheric Scattering Model (ASM) [5], which assumes a uniform haze distribution across the image. This model uses physical priors to establish a mapping between the hazy image and the dehazed image, enabling haze removal. The ASM can be expressed by Equation (1),

[eqn]

where x represents the image pixel values, I and J denote the hazy and dehazed images, t and A represent the atmospheric transmission map and atmospheric light, respectively. Traditional methods focus on estimating the transmission map and atmospheric light to remove haze [6]. However, real-world environments often exhibit varying haze concentrations and distances between the camera and objects, leading to non-homogeneous haze distributions across the image. As a result, ASM-based methods fail to accurately estimate the atmospheric light and transmission map, leading to incomplete dehazing and suboptimal performance in non-homogeneous haze conditions. Therefore, advanced dehazing techniques capable of handling non-homogeneous haze distributions are essential for future image processing applications.

In recent years, deep learning-based dehazing methods have shown promising results due to their strong fitting capabilities [7]. However, these methods rely heavily on extensive image data, and acquiring images with non-homogeneous haze in real environments remains a significant challenge. Despite substantial progress in the field, existing deep learning methods still struggle to handle dense fog clusters in real-world, non-homogeneous fog distributions [8,9]. This shortcoming is primarily due to these methods’ inability to adapt effectively to the non-homogeneousity of fog, resulting in poor restoration of image details and textures under varying environmental conditions. To address this issue, synthetic data has been proposed for training; however, accurately simulating the complex interaction between non-homogeneous haze concentration and underlying image semantics remains a formidable challenge. The primary challenge is to accurately define and differentiate haze distribution patterns, efficiently reconstruct image contours and textures in dense fog regions, and restore fine details in areas with light fog. To overcome these challenges, this paper proposes a novel Multi-Level Composite Attention-Guided Network (MLCANet) for non-homogeneous image dehazing. Unlike existing attention-based dehazing methods that stack attention modules in a heuristic or purely parallel manner, MLCANet introduces a task-driven serial attention hierarchy that progressively refines haze representations from channel-level semantics to spatial structures and finally to fine-grained pixel-level details. This serial order is not arbitrary but is explicitly designed to align with the physical characteristics of non-homogeneous haze degradation.

Unlike existing attention-based dehazing methods such as SCANet [10], which employs a self-paced semi-curricular learning strategy to progressively adapt to difficult haze distributions, MLCANet focuses on architectural innovation through a hierarchical serial attention mechanism. While SCANet achieves its performance through sophisticated training dynamics, MLCANet explicitly models non-uniform haze density via a progressive feature refinement process: channel attention first identifies haze-relevant features, spatial attention then localizes haze distributions, and multi-scale pixel attention finally enhances pixel-level details in dense fog regions. This fundamental difference in design philosophy—architectural hierarchy versus training strategy—distinguishes MLCANet from existing methods and enables effective non-homogeneous dehazing without complex training schedules.

Specifically, MLCANet consists of two core components: the Multi-Level Composite Attention Generation Network (MCAGN) and the Dehazed Image Reconstruction Network (DIRN). The key novelties of MLCANet are threefold:

(1) Task-driven serial attention composition: The sequential integration of Channel Attention (CA), Spatial Attention (SA), and Multi-Scale Pixel Attention (MSPA) is motivated by a hierarchical decomposition of the dehazing problem—first identifying haze-relevant features, then localizing haze distributions, and finally enhancing pixel-level details under varying haze densities. This distinguishes our design from existing attention-based dehazing networks that often adopt parallel or empirically stacked attention modules without clear functional differentiation.

(2) Self-supervised attention guidance via Y-channel luminance difference: We introduce a novel self-supervised attention map generated from the Y-channel difference of the YCbCr color space, which serves as a physically meaningful proxy for haze density. This guidance mechanism is unique to our method and is not present in prior attention-based dehazing works.

(3) Tight integration with deformable and dilated convolutions: The attention-refined features are not simply passed to a generic decoder but are coupled with a reconstruction network specifically designed for non-homogeneous haze, combining multi-scale dilated convolutions and deformable convolutions to jointly model global context and local deformations.

(4) Extensive qualitative and quantitative analyses, including ablation studies, demonstrate the feasibility and effectiveness of MLCANet in non-homogeneous image dehazing.

By explicitly addressing the spatial variability of haze through this composite attention-reconstruction synergy, MLCANet achieves superior dehazing performance, particularly in restoring fine details and textures in dense haze regions, while providing clearer and more reliable visual information for downstream tasks.

2. Related Work

In the area of single-image dehazing, current approaches are typically classified into two primary types. The first category consists of prior-based methods that rely on empirical priors derived by observing statistical differences between hazy and clear images. These methods utilize well-known image properties to guide the dehazing process. The second category includes data-driven approaches that involve learning mapping functions directly or indirectly from large datasets to perform image dehazing. Each approach has its focus, with the latter depending on machine learning techniques and training data to recognize and execute dehazing tasks.

Prior-based Image Dehazing Methods: The first category includes prior-based methods that rely on empirical priors, which are derived by observing statistical differences between hazy and clear images. These methods exploit well-known image properties to guide the dehazing process. Early prior-based methods, such as the Dark Channel Prior (DCP) [11], Color Attenuation Prior [12], Non-local Prior [13], Gamma Correction Prior (GCP) [14], and the linear transformation-based method proposed by Wang et al. [15]. utilize physical priors to estimate atmospheric light and transmission maps based on the atmospheric scattering model. These methods are effective for images with uniform haze distribution. However, in real-world scenarios with non-homogeneous haze distribution, predefined priors fail to completely remove haze, often resulting in incomplete dehazing and artifacts [10].

Data-driven Image Dehazing Methods: With the advancement of deep learning, data-driven techniques have attracted considerable interest. Early deep learning-based methods for dehazing often employed convolutional neural networks (CNNs) to estimate atmospheric light and transmission maps, leveraging the atmospheric scattering model. Notable examples include DehazeNet [16], MSCNN [17], AOD-Net [18], and DCPDN [19]. However, these methods suffer from inaccuracies in parameter estimation, which leads to reduced performance in non-homogeneous foggy conditions. To reduce dependence on the atmospheric scattering model, newer methods utilize synthetic datasets to directly learn the mapping between hazy and clear images in an end-to-end manner. These methods include GridDehazeNet [20], FFA-Net [21], AECR-Net [22], PMDNet [23], and Dehamer [24]. While deep learning-based methods have shown significant progress, their reliance on large training datasets limits their performance in non-homogeneous haze conditions. Moreover, the increasing computational complexity of these networks negatively impacts processing efficiency. Despite advancements with synthetic datasets, these methods still struggle to effectively handle dehazing in non-homogeneous, real-world environments. Recently, specific solutions have been proposed for non-homogeneous dehazing tasks [25], but challenges remain in addressing uneven haze distribution and restoring fine image details [26].

In addition to the aforementioned deep learning-based dehazing methods, several very recent studies have further advanced the field. For instance, the work by systematically evaluates various image quality metrics as loss functions for image dehazing, providing valuable guidance for loss function design [27]. Another recent study proposes a regression-based depth and scattering estimation network for efficient dehazing. These emerging directions highlight the continuous activeness of this research area and offer potential synergies with our work [28].

3. Proposed Method

3.1. Overall Network Architecture

The architecture of MLCANet is shown in Figure 1, featuring a dual-branch configuration. It consists of two main components: the Multi-Level Composite Attention Network (MCAGN) and the Image Reconstruction Network (DIRN). The MCAGN integrates Channel Attention (CA), Spatial Attention (SA), and Multi-Scale Pixel Attention (MSPA), enabling MLCANet to effectively capture and process haze of varying densities. This improves the precision and effectiveness of dehazing. The DIRN combines multi-scale dilated convolution with deformable convolution in its decoder-encoder structure, allowing for the flexible restoration of image details and textures, particularly in regions with high haze concentrations.

3.2. Multi-Level Composite Attention Generation Network (MCAGN)

Traditional dehazing methods often fail to account for the non-homogeneous distribution of haze in real-world images, particularly at the channel, spatial, and pixel levels. This limitation leads to suboptimal dehazing results, especially in regions with varying haze densities. To address this issue, we propose a Multi-Level Composite Attention Generation Network (MCAGN). As shown in Figure 2, MCAGN integrates three types of attention modules in a serial, task-driven manner: Channel Attention (CA), Spatial Attention (SA), and Multi-Scale Pixel Attention (MSPA). The serial order of CA → SA → MSPA is designed to follow a hierarchical refinement logic: channel attention first identifies which features are haze-relevant, spatial attention then determines where the haze is located, and multi-scale pixel attention finally refines how to enhance each pixel under varying haze densities. This decomposition aligns with the physical process of haze degradation and provides clear functional differentiation among attention modules, distinguishing our design from parallel or heuristic stacking approaches. This hierarchical design is particularly suited for non-homogeneous haze because different regions of the image require different levels of feature refinement—channel-level selection suffices for thin haze, while dense haze regions additionally demand spatial localization and pixel-level enhancement. By progressively increasing the representational granularity, the network adapts its focus to the local haze density without requiring explicit density estimation.

Motivation for serial attention design. The serial order of CA → SA → MSPA is not heuristic but is deliberately designed to follow a hierarchical refinement strategy that mirrors the physical degradation process of non-homogeneous haze. Specifically, channel attention first identifies which feature channels are most relevant to haze distortion, enabling the network to emphasize semantically meaningful representations. Spatial attention then determines where the haze is located across the image plane, allowing the model to focus on regions with varying fog densities. Finally, multi-scale pixel attention refines how to enhance each pixel under different haze intensities by capturing fine-grained details at multiple receptive fields. This progressive decomposition—from channel-level semantics to spatial structure and finally to pixel-level detail—provides a clear functional among attention modules, which is fundamentally different from parallel or empirically stacked attention designs commonly used in existing dehazing networks. Parallel attention mechanisms, while computationally efficient, often fail to capture the hierarchical dependencies between different levels of haze features, leading to less effective feature refinement. In contrast, our serial design explicitly models these dependencies and minimizes information redundancy across attention stages.

By structuring the attention flow in this way, MCAGN is able to generate haze-density-aware attention maps that accurately reflect the spatial variability of fog, thereby providing more informative guidance for the subsequent image reconstruction network.

The MCAGN processes the input hazy image through a sequence of CA, SA, and MSPA layers. These layers produce attention-guided feature map outputs that emphasize regions with varying haze densities. The CA mechanism first captures local features through two 3 × 3 convolution layers, each producing feature maps representing the image’s response to different kernels. These feature maps represent different channels of the image. To capture global information, Global Average Pooling (GAP) is applied to each feature map, aggregating information across spatial dimensions into a single numerical value per feature channel. This compressed value serves as a global statistic for each channel.

Subsequently, the feature maps are processed through two 1 × 1 convolution layers and a sigmoid activation function to generate channel attention weights based on the global average pooled values. Each feature map is then multiplied by its corresponding weight, resulting in the weighted amplification of channel features. This process enhances the representation of features across varying haze densities, helping the network better recognize and process non-homogeneous haze.

The feature maps generated by the CA mechanism are subsequently fed into the Spatial Attention (SA) layer, which integrates Global Average Pooling (GAP) and Global Max Pooling (GMP) to aggregate spatial information across different channels. This produces two 1 × 1 × C feature maps, which are then processed through 1 × 1 convolutions to generate a spatial attention map. The spatial attention map is scaled to the range [0, 1] using a sigmoid function to produce spatial attention weights. These weights are element-wise multiplied with the original feature maps, enhancing important features and suppressing less relevant ones, thus refining the representation of hazy regions.

Finally, to capture feature information at various scales and effectively address areas with different haze densities, the MSPA mechanism is employed. The process begins with two 3 × 3 convolution layers to extract initial features. Dilated convolutions with four different kernel sizes (1 × 1, 3 × 3, 5 × 5, 7 × 7) are applied in parallel to capture multi-scale spatial information. The resulting multi-scale feature maps are processed with two 1 × 1 convolution layers and a sigmoid function to generate a comprehensive multi-scale attention representation. This improves the network’s ability to handle complex, non-homogeneous haze distributions and enhances dehazing performance across varying haze densities.

It has been widely observed that the luminance component (Y channel) in the YCbCr color space is highly correlated with scene depth and haze density under natural illumination conditions [29]. Motivated by this physical prior, we utilize the Y-channel luminance difference as a self-supervised guidance signal to generate the ground-truth attention map $[eqn]$ . To ensure that the dehazing network accurately focuses on the non-homogeneous distribution characteristics of haze in images and effectively suppresses the model’s tendency to allocate weights predominantly to low-concentration feature areas—which can reduce the quality of image dehazing—a self-supervised attention map is introduced. This paper utilizes a YCbCr image transformation, using the luminance difference in the Y channel as the ground truth ( $[eqn]$ ) for the attention feature map. As illustrated in Figure 1, during the final attention feature (M) generation process, the attention feature map ( $[eqn]$ ) produced by the MCAGN network is fused with the $[eqn]$ . Let $[eqn]$ represent the fusion coefficient between them; then, the final fused attention feature can be expressed as follows:

[eqn]

where the fusion coefficient is dynamically adjusted based on the L1 loss, which is calculated as follows:

[eqn]

The dynamic adjustment of the fusion coefficient between $[eqn]$ and $[eqn]$ is governed by the equation above. Initially, the attention feature map $[eqn]$ is predominantly composed of $[eqn]$ to effectively suppress model divergence issues caused by significant adversarial losses during the initial stages. As the loss function decreases, the proportion of $[eqn]$ within $[eqn]$ is gradually increased. Finally, when the loss $[eqn]$ , consists solely of $[eqn]$ .

The final multi-level composite attention features (F) are obtained by weighting through an adaptive gating unit ( $[eqn]$ ). The computation process is outlined as follows:

[eqn]

In the computation process, the symbol ⊗ denotes element-wise multiplication of pixel features.

The fusion coefficient $[eqn]$ in Equation (2) and the loss threshold for switching from $[eqn]$ to $[eqn]$ are set to 0.3 and 0.05, respectively. These values were selected based on cross-validation on the validation set and yield the best overall dehazing performance in our experiments.

3.3. Dehazed Image Reconstruction Network (DIRN)

To improve the adaptability of the image dehazing network to varying haze distributions and to accurately restore image details, we propose an efficient image reconstruction network that incorporates a decoder-encoder structure. This structure combines multi-scale dilated convolutions and deformable convolutions. The integration of these two convolutional techniques offers several key advantages:

(1) Expanded Receptive Field: Dilated convolutions increase the receptive field by enlarging the dilation rate of the convolutional kernels. This enables the network to capture more contextual information, which is crucial for image dehazing, as haze often has a non-homogeneous distribution and a widespread effect across the image. Dilated convolutions help the model cover a larger area without significantly increasing computational costs.

(2) Adaptation to Local Deformations: Deformable convolutions add flexibility to the network. They adjust the shape and position of the kernels by learning the offsets of the convolutional kernels, allowing the model to better adapt to local features and deformations. In areas of visual distortion caused by haze, deformable convolutions can more accurately restore the image’s original appearance by adjusting their sampling positions.

(3) Enhanced Adaptability to Complex Environments: The combination of dilated and deformable convolutions allows the model to more effectively handle non-homogeneous haze. Dilated convolutions capture macro-level contextual information, while deformable convolutions enable micro-level adjustments. This synergy enables the network to precisely handle variations in haze density and feature changes across different regions.

(4) Improved Detail Restoration: In image dehazing, especially in urban landscapes or complex backgrounds, restoring image details is critical. The hierarchical combination of dilated and deformable convolutions allows the network to make precise adjustments to fine details while maintaining a large receptive field. This improves the quality of the restored image.

The image reconstruction network, as shown in Figure 1, includes a module called Multi-scale DilatConv & DeformConv, as illustrated in Figure 3. The network starts with two 3 × 3 convolutional layers for downsampling and feature extraction, followed by dilated convolution layers to enlarge the receptive field and capture a wider range of contextual information. These dilated convolutions are set with varying dilation rates (1, 2, and 3) to accommodate different scales of haze features in the image.

Following the dilated convolutions, deformable convolution layers are introduced. These layers dynamically adjust the convolution kernels by learning offsets, allowing the model to focus more precisely on local variations in haze density and distortion. Finally, the network gradually restores features to their original resolution through two stride-2 transpose convolution layers.

The dehazed image is produced through a restoration module, which includes edge padding, a 7 × 7 convolution layer, and a tanh activation function. This ensures that the output image maintains naturalness and visual appeal.

3.4. Dehazing Loss Function

During the training process of the proposed MLCANet, a multi-source loss function joint constraint is utilized. Specifically, the total loss function of the model, as shown in Equation (5), comprises five different types of loss functions. The L1 smoothness loss ( $[eqn]$ ) primarily ensures the fidelity of overall brightness and color between the dehazed image and the real-world image [30]. The perceptual loss ( $[eqn]$ ) [31] enhances the network’s capability to restore image details by measuring the similarity in the image feature space. The multi-scale structural similarity loss ( $[eqn]$ ) [32] improves the network’s ability to express image contour information by evaluating the structural similarity between the hazy image and the generated dehazed image. The adversarial loss ( $[eqn]$ ) [33] aims to enhance the generalization ability of the dehazing network, capable of removing not only non-homogeneously distributed haze but also effectively eliminating the disturbances caused by uniformly distributed haze, while encouraging the generation of visually more authentic and natural images, reducing the visual disturbances caused by non-homogeneous haze. The mean squared error loss ( $[eqn]$ ) [34] works at the pixel level by calculating the pixel-wise differences between clear images and hazy images, enhancing the detail texture information of images at the pixel level, ensuring precise optimization. By comparing pixel by pixel, the differences between dehazed images and real images are meticulously adjusted for each pixel’s brightness and color, making the details clearer. This is particularly important for non-homogeneously distributed haze, as such haze often exhibits complex variations at the pixel level.

[eqn]

The hyperparameters $[eqn]$ are set based on prior experience and experimental adjustments. They have been empirically set to 1, 0.5, 0.01, 0.6, and 0.005 respectively, as these values yield the best dehazing effects for the model.

L1 Smooth Loss:The L1 smooth loss is primarily used to measure the absolute error between the dehazed image $[eqn]$ and the real environment image $[eqn]$ (ground truth). This loss helps maintain the fidelity of the image, ensuring that the dehazed image in terms of color and brightness should closely resemble the actual conditions. It can be expressed as:

[eqn]

[eqn]

where $[eqn]$ represents the pixel value of the dehazed image at the pixel location $[eqn]$ , $[eqn]$ represents the pixel value of the real environment image at the same location, and N is the total number of pixels. This metric encourages the preservation of natural appearance in the dehazed image, aligning it more closely with what is observed in the actual environment.

Perceptual Loss: To enhance the ability to restore image details, particularly in terms of texture and edge sharpness, we introduce the perceptual loss function, which can be represented as follows:

[eqn]

where, $[eqn]$ represents the feature map obtained from the l-th layer of a pretrained deep network (in this case, VGG16 trained on ImageNet) for the dehazed image $[eqn]$ , and $[eqn]$ represents the feature map for the real environment image $[eqn]$ obtained from the same layer. N denotes the number of features, and $[eqn]$ denotes the L2 norm, used to quantify the Euclidean distance between the two feature mappings. This loss function helps the network to better align the restored details of the dehazed image with those of the ground truth by minimizing the distance in a high-dimensional feature space, thereby improving the perceptual quality of the dehazed images.

Multi-Scale Structural Similarity Loss (MS-SSIM Loss): The Structural Similarity Index (SSIM) is a metric used to assess the similarity between two images by taking into account luminance, contrast, and structure. Multi-Scale SSIM (MS-SSIM) extends this concept by evaluating image quality across multiple scales, offering a more thorough evaluation of structural integrity. The loss function is defined as follows:

[eqn]

[eqn]

where $[eqn]$ and $[eqn]$ represent the pixel values of the dehazed image and the ground truth image, respectively. The MS-SSIM index calculates similarity across multiple scales by adjusting parameters such as c and $[eqn]$ (small constants to stabilize the division with weak denominator), and uses mean ( $[eqn]$ ), standard deviation ( $[eqn]$ ), and variance ( $[eqn]$ ) in its computation. This approach allows the dehazing model to preserve important structural information across various scales of the image, enhancing the perceptual quality and fidelity of the dehazed outputs.

Adversarial Loss: To further enhance the generalization capability of the model and enable the dehazing network to effectively handle not only non-homogeneous but also uniform haze, adversarial loss is introduced during model training. The expression for adversarial loss is as follows:

[eqn]

where, $[eqn]$ represents the output of the discriminator network for the dehazed image, which assesses the “realness” of the image, i.e., the probability that the discriminator considers the image to be a real one. S denotes the number of training samples. This loss function guides the network to generate dehazed images that closely resemble real, haze-free images, promoting the creation of visually realistic and naturally looking results.

Mean Squared Error Loss (MSE Loss): The MSE loss quantifies the similarity between the dehazed and real images by summing the squared pixel-wise differences between them. This loss is particularly effective in refining the quality of images at the pixel level, especially in restoring the texture details of the image. The formula for MSE loss is as follows:

[eqn]

where $[eqn]$ represent the number of channels, width, and height of the image, respectively. This loss function helps ensure that each pixel of the dehazed image closely matches the corresponding pixel in the ground truth image, thus promoting high fidelity in the dehazing output.

In summary, the combined loss function not only improves specific aspects of image quality but also yields synergistic effects that enhance overall dehazing performance. This is especially effective in handling non-homogeneous haze, where such a comprehensive approach can more efficiently address the inconsistencies of haze across different regions. As a result, the dehazing network model (MLCANet) can better understand and reconstruct images, providing more accurate and visually pleasing dehazing outcomes.

4. Experiments and Analysis

This section describes the experimental setup used to evaluate the MLCANet network. We first present the experimental parameters, dataset distribution, and both subjective and objective evaluation methods. Then, we perform a thorough comparison, both qualitatively and quantitatively, using various objective metrics to assess the performance of traditional and data-driven dehazing algorithms, highlighting the advantages of our non-homogeneous image dehazing approach. Following this, we conduct ablation studies to validate the contribution of each component of the network. Additionally, we evaluate the scalability of the dehazing algorithm by applying it to advanced computer vision tasks, such as object detection in foggy environments. Finally, we provide an in-depth discussion on dehazing efficiency, computational complexity, and robustness. Through this extensive experimental evaluation, we show that the proposed dehazing method not only effectively mitigates the inconsistencies in non-homogeneous haze distribution but also enhances the clarity of image details and textures.

4.1. Implementation Details

The non-homogeneous image dehazing network is implemented using the PyTorch framework and trained with the Adam optimizer ( $[eqn]$ ). The initial learning rate is set to 0.0001, batch size of 4. The training process spans 200 epochs. During training, data augmentation techniques, including random flipping and random cropping to 512 × 512 image patches, are employed. The learning rate remains constant for the first 100 epochs and then linearly decays to 0.00001 over the next 100 epochs. The hyperparameters in the loss function are selected through extensive cross-validation and empirical tuning. The entire network is trained in an end-to-end manner, with joint optimization, eliminating the need for large-scale pre-training.

4.2. Datasets and Evaluation Metrics

We employ three non-homogeneous haze datasets and two dense uniform haze datasets to train and evaluate the MLCANet model. These datasets include NTIRE2020 [35], NTIRE2021 [36], NTIRE2023 [37], NTIRE2019 [38], and O-HAZE [29]. NTIRE2019 and O-HAZE are uniform haze datasets. NTIRE2019 contains 55 images (1600 × 1200) with 45 training, 5 validation, and 5 test images. O-HAZE consists of 45 pairs of hazy and haze-free images captured with a haze machine in real outdoor environments. NTIRE2020, NTIRE2021, and NTIRE2023 are non-homogeneous haze datasets. NTIRE2020 and NTIRE2021 images are also 1600 × 1200, with NTIRE2020 containing 45 training, 5 validation, and 5 test images, and NTIRE2021 containing 25 training, 5 validation, and 5 test images. NTIRE2023, with a resolution of 4000 × 6000, contains 40 training, 5 validation, and 5 test images. The dataset distribution is summarized in Table 1.

For objective evaluation, we employ the Structural Similarity Index (SSIM) [39] and Peak Signal-to-Noise Ratio (PSNR) [40] to assess the quality of the dehazed images. Additionally, a user study is conducted to subjectively evaluate the proposed method against recently proposed dehazing algorithms. This comprehensive subjective evaluation provides a holistic view of the dehazing performance.

Furthermore, we compare the proposed MLCANet method against various prior-based and data-driven image dehazing methods. The prior-based method considered is the Dark Channel Prior (DCP) [11]. For data-driven methods, we evaluate both atmospheric scattering model-based methods, such as AOD-Net [18], and purely data-driven approaches, including GCANet [41], GridDehazeNet [20], FFA-Net [21], C2PNet [25], SGDRL [42], MFINEA [3], DEA-Net [24], MixDehazeNet [43], and the CNN-transformer hybrid DeHamer [23].

4.3. Comparison with Leading Methods

To assess the overall dehazing performance of the MLCANet network, we perform both qualitative and quantitative comparisons with traditional prior-based methods and cutting-edge deep learning approaches on multiple test datasets. Due to differences in the training samples available for each method, for those offering pretrained models, we directly use the best available pretrained models to compute the objective evaluation metrics. For methods without pretrained models, we retrain them using publicly released code and adjust parameters to ensure a fair comparison.

To ensure fairness, all compared methods are evaluated on the same test images under identical resolutions. For retrained methods, we follow the training/validation splits and hyperparameter settings recommended by the original authors. Testing is performed on multiple datasets, which include both uniformly and non-homogeneously distributed haze as well as real-world image datasets, providing a comprehensive evaluation of each method’s performance.

4.3.1. Image Dehazing Quantitative Comparisons

Table 2 presents a comparison of the performance of several advanced image dehazing algorithms across five different test datasets. These datasets vary in haze conditions, ranging from light to heavy haze, providing a diverse set of challenges for evaluating algorithm performance. We use Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) to assess the dehazing effectiveness and stability of each algorithm.

The compared methods include recent state-of-the-art approaches specifically designed for non-homogeneous dehazing, such as SCANet [10] and MixDehazeNet [43], as well as strong baselines like C2PNet [25] and DeHamer [23]. This ensures a comprehensive and up-to-date evaluation of our proposed MLCANet.

As shown in the table, MLCANet and SCANet exhibit superior performance across all datasets. Specifically, MLCANet achieves an average PSNR of 20.76 and SSIM of 0.6106, while SCANet achieves an average PSNR of 19.66 and SSIM of 0.5858. These results demonstrate not only high restoration quality but also a high degree of structural similarity in the dehazed images. In contrast, algorithms like C2PNet and DeHamer show good performance on certain datasets but exhibit significant fluctuations when tested on more complex or high-density haze conditions. For example, C2PNet achieves a PSNR of 15.01 and SSIM of 0.5021 on the NTIRE2023 dataset but experiences a notable drop to 9.72 and 0.3745 on the NTIRE2020 dataset.

In terms of stability, MLCANet and SCANet show minimal performance variation across different datasets, indicating better generalization and robustness. Conversely, MixDehazeNet exhibits larger performance fluctuations, particularly on the NTIRE2020 and NTIRE2021 datasets, where it shows much lower PSNR and SSIM values compared to other datasets. This variability may indicate the algorithm’s limitations in handling specific types of haze or image conditions.

Prior-based dehazing methods generally perform poorly on non-homogeneous haze images, struggling to eliminate the interference from varying haze concentrations. However, data-driven methods show a strong adaptability to non-homogeneous haze distribution tasks, with noticeable improvements in PSNR and SSIM. From Table 2, it is clear that the proposed MLCANet not only excels in non-homogeneous haze environments but also performs well in uniform haze conditions, showing improvements over prior methods.

Compared to traditional dehazing approaches, MLCANet significantly outperforms prior methods in terms of both PSNR and SSIM. This can be attributed to two main factors: first, the use of the Multi-Level Composite Attention Network (MCAGN) to capture multi-channel and multi-level spatial features in regions with varying haze densities, allowing the model to better understand and process different haze intensities. Second, the image reconstruction process benefits from the encoder-decoder structure, which combines multi-scale dilated convolution and deformable convolution. This combination enhances the model’s receptive field while improving its adaptability to complex environments, ultimately improving the restoration of image details.

It is worth noting that while MLCANet achieves substantial improvements on the O-HAZE and NTIRE2019 homogeneous haze datasets, its performance on non-homogeneous datasets (e.g., NTIRE2021 and NTIRE2023) is comparable to, and in some cases slightly behind, SCANet [10]. This can be attributed to the different design philosophies of the two networks. SCANet incorporates a curriculum learning strategy that progressively adapts to difficult haze distributions, which is particularly effective for highly diverse non-homogeneous scenes. In contrast, MLCANet focuses on explicit hierarchical feature refinement via serial attention, which provides stronger regularization and generalizes well to homogeneous haze, but may be less adaptive to extreme variations in non-homogeneous conditions. We believe these two paradigms are complementary, and integrating curriculum learning into our framework is a promising direction for future work.

4.3.2. Image Dehazing Qualitative Comparisons

To more intuitively demonstrate the visualization effectiveness of MLCANet in the task of non-homogeneous image dehazing, we conduct a qualitative analysis comparing this model with other classical dehazing models from a visual perception perspective. The results of these comparisons across five different datasets are shown in Figure 4.

From the visual comparisons, it is clear that prior-based dehazing methods struggle to accurately estimate haze thickness in different regions of the image, which results in the failure to effectively remove the interference of non-homogeneous haze clusters. This often leads to severe color distortion in the restored images. While data-driven dehazing methods show improvements in image brightness and contrast, they still fail to fully dehaze images. For instance, AOD-Net alleviates color distortion issues but introduces noticeable artifacts due to the non-homogeneous haze distribution. FFA-Net recovers much of the detail texture information across various regions, but color distortion still occurs, especially at object edges and in distant targets. Dehazing models that combine CNN and Transformer architectures manage to reduce some haze interference but often compromise on the preservation of fine texture details in the image.

In contrast, the proposed MLCANet effectively removes the interference caused by non-homogeneously distributed haze and shows strong suppression of uniform thin haze. Figure 5 demonstrates the dehazing effect in real-world foggy scenes. Compared to other architectures, the Multi-Level Composite Attention-Guided Network for Non-Homogeneous Image Dehazing improves generalization in real-world scene dehazing tasks. It enhances the overall quality of dehazed images, ensuring the preservation of detailed image features and deep semantic information, ultimately providing superior visual results.

4.4. User Subjective Comparison

To further highlight the visual perception superiority of the proposed image dehazing algorithm, we conducted a human visual survey. We randomly selected 50 images with varying haze concentrations and used seven classical image dehazing algorithms, including the proposed MLCANet, to dehaze these images. A group of 10 testers was invited to rate the 400 dehazed images on a scale of 1 to 10, without knowing the specific dehazing methods used. The statistical analysis of the ratings for each dehazing method is presented in Table 3.

As shown in the table, the proposed non-homogeneous haze dehazing algorithm consistently achieved the highest ratings in terms of human visual perception, demonstrating its superiority over other methods in restoring clear and realistic images. It should be clarified that the method denoted as “(n) DeHamer” in Figure 4 corresponds to “DeHamer” in Table 3. While its visual dehazing is less satisfactory, it still receives relatively high perceptual scores, possibly due to its ability to preserve natural color and texture despite incomplete haze removal.

4.5. Ablation Study

In this section, we conduct a series of ablation experiments to analyze the impact of various key components and corresponding parameters in the MLCANet dehazing network. The experiments are performed on several datasets, including O-HAZE, NTIRE2019, NTIRE2020, NTIRE2021, and NTIRE2023, with both qualitative and quantitative evaluations.

Table 4 presents the results of the ablation study on the core components of the MLCANet model. We use removal and replacement techniques to assess the contribution of each functional module, including the Multi-Level Composite Attention Network (MCAGN), the Dehazed Image Reconstruction Network (DIRN), and the combined loss function, in the image dehazing process. A total of 15 different ablation configurations are tested.

As shown in Table 4, models (a) and (d) highlight that the MCAGN plays a crucial role in effectively capturing and processing haze regions of varying densities. The presence of MCAGN significantly improves the performance of non-homogeneous image dehazing, providing a clear advantage in handling haze with differing intensities across the image.

In model (d), replacing the DIRN with a traditional encoder-decoder structure (ED) leads to a decrease in PSNR and SSIM values, underscoring the importance of the encoder-decoder structure that integrates multi-scale dilated convolutions and deformable convolutions. This combination helps preserve detailed texture information during the dehazing process, thus enhancing the effectiveness of the image reconstruction module.

To further validate the effectiveness of the proposed serial attention design, we construct a parallel variant of MCAGN (denoted as Parallel-MCAGN). In this variant, the Channel Attention (CA), Spatial Attention (SA), and Multi-Scale Pixel Attention (MSPA) modules are applied in parallel to the input features, and their output attention maps are concatenated along the channel dimension before being fed into the subsequent DIRN. All other experimental settings, including training data, loss functions, and optimization strategies, remain identical to those of the serial MCAGN.

As shown in Table 4 (row k), Parallel-MCAGN achieves a PSNR of 13.97 and SSIM of 0.4763. This result demonstrates that the performance gain of our method is not merely from stacking multiple attention modules, but from the carefully designed hierarchical refinement order. The serial architecture enables progressive feature refinement from channel-level semantics to pixel-level details, which is essential for handling non-homogeneous haze distributions.

Models (e) through (j) test different attention modules within the MCAGN to validate their impact on feature extraction for non-homogeneous image dehazing. These models demonstrate that the integration of various attention mechanisms enables the DIRN to better focus on regions with non-homogeneous haze distributions, resulting in significant improvements in both PSNR and SSIM scores for the dehazed images.

To further evaluate the role of the combined loss function, models (l) to (o) compare objective evaluation metrics across different configurations. The results show that incorporating perceptual loss, multi-scale structural similarity loss, and adversarial loss progressively enhances the overall performance of the MLCANet dehazing network, confirming that the combined loss function contributes to better image restoration.

4.6. Impact on Downstream Vision Tasks

The quality of input images is crucial for the performance of downstream computer vision tasks, including object detection and depth estimation. In foggy traffic environments, for example, visual sensors—especially those used in autonomous vehicles—are highly sensitive to the effects of non-homogeneous haze distribution. This haze can cause varying degrees of obstruction in the road environment captured by in-vehicle cameras, which severely impacts the accuracy of environmental perception. To validate the effectiveness and scalability of the proposed MLCANet dehazing network, we evaluate its performance in the context of downstream tasks like object detection and depth estimation.

4.6.1. Object Detection

We employ the advanced YOLOv7 [44] object detection network to identify objects in dehazed images generated by different dehazing algorithms. As shown in Figure 6, the detection accuracy of images post-dehazing significantly improves. Specifically, MLCANet demonstrates notable reductions in both missed detections and false detections, leading to more accurate and reliable object recognition, especially for distant and occluded targets. Compared to other dehazing methods, MLCANet outperforms them in object detection tasks due to its superior ability to preserve detailed texture information, which is crucial for accurate feature extraction in detection models.

4.6.2. Depth Estimation

For depth estimation in foggy environments, we perform dehazing operations on real-world scene images using various dehazing methods, followed by depth estimation using the Deep Analyzing algorithm [45]. Figure 7 presents a comparison of depth estimation results for different dehazing algorithms. The results clearly show that effective dehazing enhances the accuracy of depth estimation tasks, particularly in mitigating depth information errors caused by haze obstruction. Among the tested dehazing methods, MLCANet stands out by effectively suppressing the interference of non-homogeneous haze, providing more reliable feature information for depth estimation. As a result, MLCANet outperforms other dehazing methods in terms of depth accuracy, demonstrating its superior performance in preserving the integrity of depth cues in challenging environments.

4.7. Discussion

4.7.1. Running Time and Complexity Comparison

The efficiency of image dehazing is crucial for real-time applications in advanced computer vision tasks. To evaluate the timeliness of different dehazing algorithms in processing a single image, we tested all dehazing algorithms on the same GPU platform, a GeForce GTX 2080Ti with 11 GB of memory. The average dehazing time in non-homogeneous haze scenarios is shown in Table 5.

As observed in the table, prior-based methods (which rely on mathematical statistics) require multiple iterative calculations during the dehazing process, leading to longer processing times. In contrast, image dehazing algorithms based on Convolutional Neural Networks (CNNs) and Transformers benefit from GPU acceleration, resulting in faster processing times compared to traditional prior-based methods. Among deep learning-based dehazing methods, MLCANet is not the most time-efficient due to the complexity introduced by its Multi-Level Composite Attention Network (MCAGN) and Dehazed Image Reconstruction Network (DIRN). However, despite its slightly higher running time, MLCANet delivers the best performance for non-homogeneous image dehazing, especially in complex scenarios where haze distribution is inconsistent.

Table 5 compares the network parameters and floating-point operations (FLOPs) of various dehazing algorithms applied to non-homogeneous haze images with a resolution of 1200 × 1600 pixels, including all network components (attention modules, convolutions, activations). The results show that, despite having a higher number of parameters due to its attention mechanisms, MLCANet achieves a lower FLOPs count compared to other state-of-the-art methods. This demonstrates that MLCANet strikes an effective balance between performance and computational efficiency. The apparent disproportion between MLCANet’s parameters (2.73 M) and FLOPs (278.6 G) compared to GridDehazeNet (0.958 M parameters, 271.9 G FLOPs) is explained by architectural differences: GridDehazeNet uses grid-like attention that is computationally intensive (high FLOPs) but parameter-efficient. MLCANet’s serial attention modules introduce additional parameters (channel-wise transformations) but are computationally efficient because attention weights are computed once per stage and reused. The multi-scale dilated convolutions increase FLOPs only marginally despite adding parameters, as they operate on downsampled feature maps.

4.7.2. Limitation

Although the proposed non-homogeneous image dehazing network excels in visual quality, evaluation metrics (both subjective and objective), efficiency, and robustness in enhancing visual target detection under foggy conditions, it has two main limitations in its architectural design:

(1) Serial Operation in the MCAGN: The Multi-Level Composite Attention Network (MCAGN) employs a serial operation mode for its three attention modules. While this serial structure enables the network to explore the potential relationships between local and global features, it also introduces the risk of error accumulation. Ineffective features in the earlier stages of the network could be propagated and amplified in subsequent stages. This could result in incomplete dehazing, particularly when handling dense fog in real-world environments. Ideally, parallel processing or more sophisticated error handling mechanisms could mitigate this limitation.

(2) Limited Training Data and Domain Adaptation: The model’s performance is highly dependent on the paired datasets used for training. The limited number of available paired datasets restricts the network’s ability to generalize to new or unseen haze scenarios, particularly when haze conditions differ from those in the training data. This limitation poses challenges in adapting to domain shifts, where the characteristics of the haze or scene might vary. More diverse and expansive datasets, as well as domain adaptation techniques, could help overcome this issue and improve the robustness of the dehazing model across different environmental conditions.

Additionally, the proposed Y-channel guided attention assumes a consistent correlation between luminance and haze density, which may be less reliable under extreme illumination variations (e.g., nighttime or low-light scenes). Exploring adaptive or illumination-aware guidance mechanisms is an important direction for future work.

5. Conclusions

In this paper, we proposed a novel Multi-Level Composite Attention-Guided Dehazing Network (MLCANet) for non-homogeneous haze image dehazing. The network leverages multi-level attention modules to effectively capture and process feature information across non-homogeneous haze regions, while also preserving the underlying image semantics. To further enhance dehazing performance, we designed an efficient Dehazed Image Reconstruction Network (DIRN) that combines multi-scale dilated convolutions and deformable convolutions. This combination increases the model’s adaptability to complex environments by expanding the receptive field, allowing it to better restore detail and texture information, particularly in regions affected by dense haze.

We performed extensive qualitative and quantitative experiments, along with ablation studies, to verify the effectiveness of each component of MLCANet. Furthermore, we showcased its applicability to advanced vision tasks, such as object detection and depth estimation, under foggy conditions. The results demonstrate that the MLCANet dehazing network effectively eliminates non-homogeneous haze interference, enhances the visual quality of dehazed images, and accurately recovers fine details and textures, particularly in dense haze regions. In future work, we aim to incorporate MLCANet into high-level vision tasks, such as object detection and instance segmentation, to boost performance in challenging, foggy conditions. This integration seeks to enhance the accuracy and robustness of vision systems, especially in scenarios with non-homogeneous haze distributions. Beyond its empirical performance, MLCANet also provides a conceptually interpretable design paradigm for non-homogeneous degradation: hierarchical attention as a mechanism for spatially adaptive feature refinement. We believe this insight contributes a new perspective to the broader image restoration community. Furthermore, given that adverse weather encompasses various phenomena beyond haze, extending MLCANet to handle rain, snow, and low-light conditions through multi-task learning or domain adaptation is an interesting direction for future research.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Wang H. Xu Y. Wang Z. Cai Y. Chen L. Li Y. Center Net-Auto: A Multi-object Visual Detection Algorithm for Autonomous Driving Scenes Based on Improved Center Net IEEE Trans. Emerg. Top. Comput. Intell.2023774275210.1109/TETCI.2023.3235381 · doi ↗
2Sindagi V.A. Oza P. Yasarla R. Patel V.M. Prior-based domain adaptive object detection for hazy and rainy conditions Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020 Proceedings, Part XIV 16Springer Berlin/Heidelberg, Germany 2020763780
3Sun H. Li B. Dan Z. Hu W. Du B. Yang W. Wan J. Multi-level feature interaction and efficient non-local information enhanced channel attention for image dehazing Neural Netw.2023163102710.1016/j.neunet.2023.03.01737011517 · doi ↗ · pubmed ↗
4Jiang S. Mei Y. Wang P. Liu Q. Exposure difference network for low-light image enhancement Pattern Recognit.202415611079610.1016/j.patcog.2024.110796 · doi ↗
5Middleton W.E.K. Vision through the atmosphere Geophysik ii/Geophysics ii Springer Berlin/Heidelberg, Germany 1957254287
6Ju M. Gu Z. Zhang D. Single image haze removal based on the improved atmospheric scattering model Neurocomputing 201726018019110.1016/j.neucom.2017.04.034 · doi ↗
7Wu H. Qu Y. Lin S. Zhou J. Qiao R. Zhang Z. Xie Y. Ma L. Contrastive learning for compact single image dehazing Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Nashville, TN, USA 20–25 June 20211055110560
8Shetty L. Non homogeneous realistic single image dehazing Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Waikoloa, HI, USA 2–7 January 2023548555