A Multi-Scale Global Fusion-Based Method for Surface Fissure Extraction from UAV Imagery

Mingxi Zhou; Min Ji; Fengxiang Jin; Zhaomin Zhang; Fengke Dou; Xiangru Fan

PMC · DOI:10.3390/s26051440·February 25, 2026

A Multi-Scale Global Fusion-Based Method for Surface Fissure Extraction from UAV Imagery

Mingxi Zhou, Min Ji, Fengxiang Jin, Zhaomin Zhang, Fengke Dou, Xiangru Fan

PDF

Open Access

TL;DR

This paper introduces MGF-UNet, a new method for accurately extracting surface fissures from high-resolution UAV imagery to aid in geohazard monitoring.

Contribution

The novel MGF-UNet integrates multi-scale feature sensing and a transformer-based module for improved fissure detection in complex terrains.

Findings

01

MGF-UNet achieves 78.2% accuracy, 81.4% Dice score, and 68.6% IoU on fissure detection benchmarks.

02

The method outperforms existing networks in capturing elongated fissures and preserving structural details.

03

It demonstrates effectiveness in deformation-prone environments for geohazard monitoring and ecological restoration.

Abstract

The prevalence of ground fissures in deformation-affected areas has intensified, presenting serious risks to both operational safety and the local natural environment. Fissures in these disturbed terrains are typically characterized by elongated morphologies and large-scale variations, which pose substantial challenges to accurate feature extraction. To address these complexities, this paper proposes a semantic segmentation network termed MGF-UNet. In the shallow layers, we integrate multi-scale feature sensing (MFS) and grouped efficient multi-scale attention (EMA) to sharpen anisotropic textures and boundary details under high-resolution representations. For the deeper layers, a Token-Selective Context Transformer (TSCT) is designed to perform selective global modeling on high-level semantic features, effectively capturing long-range dependencies while preserving the structural…

Figures16

Click any figure to enlarge with its caption.

Funding1

—Shandong Province Key R&D Program (Competitive Innovation Platform) Project

Keywords

surface fissuressemantic segmentationdeep learningmulti-scale featuresUAV remote sensing

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeophysical Methods and Applications · Rock Mechanics and Modeling · 3D Surveying and Cultural Heritage

Full text

1. Introduction

High-intensity and large-scale underground coal mining activities can induce frequent subsurface rock strata movement and goaf collapse, which subsequently result in the destruction of surface soil–rock structures and the formation of geological hazards such as ground subsidence and surface fissures [1]. As a typical type of surface geological hazard, ground fissures not only cause damage to buildings and farmland, but may also trigger a series of cascading effects, including landslides, debris flows, and ecological degradation, thereby threatening mining safety and the security of local residents [2]. Therefore, achieving efficient and accurate extraction of surface fissures is of great significance for disaster prevention and mitigation, natural environment restoration, and safe production in deformation-affected areas [3].

At present, the identification of ground fissures in mining areas mainly relies on traditional approaches such as field surveys and manual inspections, which are not only labor-intensive and inefficient but also pose considerable safety risks to personnel in complex mining environments [4]. With the rapid development of remote sensing technologies, satellite-based observations have gradually attracted increasing attention. For example, Zou et al. [5] employed deep learning methods to extract large-scale roads from remote sensing imagery. However, due to limitations in spatial resolution, satellite remote sensing is generally inadequate for capturing the fine structural characteristics of ground fissures [6]. In contrast, low-altitude unmanned aerial vehicle (UAV) remote sensing offers distinctive advantages, characterized by high spatial resolution, operational flexibility, and cost-effectiveness [7]. To date, UAV-based surveys have become a cornerstone in various geological and environmental disciplines. Specifically, they facilitate landslide monitoring and mapping within applied geology [8], enable detailed surface rupture and fault characterization in earthquake studies [9], support the structural analysis of eruptive fissures and fracture systems in volcanology [10], and underpin structural damage assessments in geoengineering [11]. Such integrated capabilities substantially refine the precision and responsiveness of fissure detection, providing a robust data foundation for fine-scale geological hazard identification. In the interpretation of high-resolution UAV imagery, several studies have attempted to employ traditional machine learning and image processing methods, such as support vector machines, random forests, and edge detection algorithms, to achieve automated fissure extraction [12,13]. Nevertheless, these methods lack the capability to model high-level semantic information and exhibit limited robustness under complex background conditions, often leading to missed detections and false positives in the presence of fissure discontinuities, surface clutter, and environmental interference.

With the rapid development of artificial intelligence and computer vision technologies, deep learning has gradually become the dominant paradigm for crack identification tasks [14]. Some studies have applied YOLO-series models to crack extraction. For instance, Liu et al. [15] employed an improved YOLOv5 model for crack detection in concrete bridges, while An et al. [16] proposed a lightweight small-object detection model, YOLO-LSN, to detect surface fissures in mining areas. Although such detection-based methods exhibit advantages in computational efficiency and real-time performance, their reliance on bounding boxes makes it difficult to accurately delineate fine crack boundaries, resulting in inherent limitations for pixel-level measurement and analysis. In contrast, convolutional neural networks (CNNs) have demonstrated strong capabilities in automatic feature learning for semantic segmentation of remote sensing imagery, enabling end-to-end extraction of spatial distributions, texture patterns, and edge characteristics of cracks from raw images [17]. In a recent study from 2026, Zhang et al. [18] developed SCAFNet, a hybrid CNN-Transformer architecture for remote sensing change detection. By synergizing the local feature extraction capabilities of CNNs with the global context modeling of Transformers, this approach effectively addresses the inherent limitations of using either architecture in isolation. Zhang et al. [19] developed PISENet, which utilizes a parallel multi-branch structure to capture contextual information ranging from local details to long-range semantics, thereby strengthening context modeling and mitigating semantic gaps across multi-scale features. In the field of pavement and wall crack detection, Zou et al. [20] proposed DeepCrack, which integrates fully convolutional networks with deep supervision to achieve multi-level feature fusion, while Mohamad et al. [21] introduced CrackPix, significantly enhancing the capture of fine-grained crack details through a deep fully convolutional architecture. For mining-area surface fissure detection based on UAV imagery, Zhu et al. [22] incorporated adaptive rectangular convolution into the ResUNet framework to dynamically adjust convolutional kernel shapes for fissures with varying scales and orientations. Hu et al. [23] proposed OrientFuse-Net, which employs orientation-sensitive convolutions to strengthen the network’s response to multi-directional fissure textures. Jiang et al. [24] developed MFPA-Net, enhancing scale adaptability through a multi-scale feature pyramid architecture. Chen et al. [25] proposed GFSegNet, which integrates an efficient sub-attention Transformer into the U-Net encoder to achieve superior segmentation performance in complex mining scenarios. Tao et al. [26] introduced DRs-U-Net, which integrates ResNet, U-Net, and a soft-hold mechanism to enable the end-to-end segmentation of surface fissures in deformation-affected areas. Subsequently, GF-former [27] incorporated an adaptive feature fusion strategy to strike a balance between semantic information and boundary details. Collectively, these recent advancements emphasize spatial-domain structural priors and convolution-based feature enhancement.

Despite the significant progress of these methods, they still face severe challenges under the influence of surface environments. On the one hand, the complex shadow interferences in mining areas exhibit spatial characteristics extremely similar to genuine fissures. On the other hand, existing convolutional architectures struggle to isolate these weak signals from complex spatial backgrounds, whereas Transformer architectures suffer from attention dilution due to the extreme sparsity of fissures [28]. Consequently, there is a need for a new method capable of both suppressing background interference and enhancing fissure saliency. Research indicates that CNNs provide more stable representations for anisotropic and narrow structures [29]. Therefore, this study adopts a CNN-based architecture and introduces a selective global context modeling mechanism at the deep stage to suppress background token interference. Furthermore, a Fourier transform-based adaptive feature fusion strategy is incorporated in the decoder to effectively reduce background noise and enhance boundary contrast. Through this design, the proposed model achieves a favorable balance between global semantic understanding and local detail restoration.

In summary, the main contributions of this paper are as follows:

(1) A mining-area surface fissure semantic segmentation network, termed MGF-UNet, is proposed based on an encoder–decoder framework. By integrating multi-scale feature perception, efficient multi-scale attention, a token selection mechanism, feature-wise linear modulation (FiLM), and adaptive feature fusion, the network achieves an effective balance between global context modeling and local detail representation.

(2) Two crack segmentation datasets are constructed to provide data support for cross-scenario fissure segmentation research: a mining-area surface fissure dataset (MFD), generated through UAV-based high-resolution orthophoto acquisition, cropping, and annotation, and a road surface fissure dataset (RFD), collected via in situ camera imaging.

(3) The proposed method is systematically evaluated on the MFD dataset and compared with multiple mainstream semantic segmentation networks. The results demonstrate that MGF-UNet achieves significant improvements in accuracy, IoU, and F1-score; meanwhile, experiments on the RFD dataset further verify the model’s generalization capability and robustness across different scenarios.

(4) Ablation experiments are conducted to validate the effectiveness of each core module (MFS, EMA, TSCT, FiLM, and AFF), and the results indicate that these designs improve fissure extraction accuracy and fine-grained detail representation to varying degrees.

2. Study Area and Dataset

2.1. Study Area

The coal mine selected as the study area in this research is located in the southern part of Changzi County, Shanxi Province, approximately 8 km from the county center, with the mining area situated at about 112° E longitude and 36° N latitude. The mining field extends approximately 6.6 km from north to south and 7.5 km from east to west, covering a total area of about 40 km^2^. The ground surface overlying the working faces is dominated by hilly terrain, and the overall topography is characterized by a gradual decrease in elevation from east to west and from south to north. The study area is located in the southeastern part of the Qinshui Basin, where coal-bearing strata are well developed. These coal measures mainly belong to the Carboniferous–Permian system, and the overlying strata of the coal seams are composed predominantly of sandstone, mudstone, and sandy mudstone, which is consistent with the typical stratigraphic framework of coal-bearing sequences in the Qinshui Basin [30,31,32]. Under the influence of underground coal extraction, the study area has undergone intense surface deformation since mining began. Surface deformation dominated by tensile extension commonly develops along the boundaries of the working faces and in the surrounding areas, which is widely recognized as a typical surface manifestation of overburden movement and ground response induced by underground mining [33]. Field investigations were conducted between June and July, during which it was observed that fissures in densely affected areas exhibit spacings of approximately 10–20 m, with a maximum fissure width of about 50 mm. Figure 1a illustrates the geographical location of the study area, Figure 1b presents the Digital Elevation Model of the study area, Figure 1c shows residential building damage caused by mining activities, Figure 1d depicts the surface morphology of the mining area, and Figure 1e illustrates farmland damage induced by ground fissures.

2.2. Dataset

High-resolution remote sensing imagery of the mining-area surface was acquired using a DJI M300 RTK UAV (SZ DJI Technology Co., Ltd., Shenzhen, China). The flight altitude was approximately 115 m, corresponding to a ground spatial resolution of about 2.5 cm. Image acquisition was conducted from June to July 2025, during which the study area was in the summer season, with corn vegetation generally reaching heights of approximately 30–50 cm. The data collection process was carried out strictly following predefined flight routes and overlap parameters to ensure complete coverage of the main surface fissure development areas within the study region. The raw images were processed using ContextCapture software (Version 23.00; Bentley Systems, Exton, PA, USA; https://www.bentley.com/software/contextcapture, accessed on 15 January 2026) for orthorectification and image mosaicking, resulting in high-precision digital orthophoto maps (DOMs).

During the data preparation stage, a sliding-window approach was applied to crop the DOM images, with the window stride set to 128 pixels, and image patches containing surface fissure targets were selected. Pixel-level fine-grained annotations were then performed using the Labelme platform. In the annotation process, surface fissures were defined as structures exhibiting clear linear morphology, geometric continuity, and significant texture or gray-scale contrast with the surrounding background. Fissure pixels were labeled with a value of 255, while all other regions were assigned a value of 0. For areas with blurred fissure boundaries or local occlusions, annotation prioritized overall fissure continuity and contextual consistency. Examples of the annotation results are shown in Figure 2.

After data augmentation operations including Gaussian blurring, random flipping, and affine transformations, a surface fissure semantic segmentation dataset containing 2072 image samples with a resolution of 256 × 256 pixels was constructed and divided into training, validation, and test sets at a ratio of 8:1:1. In addition, to further evaluate the generalization capability of the proposed method under different imaging conditions and application scenarios, a high-resolution road fissure dataset (RFD) with image sizes of 512 × 512 pixels was also built using images captured by mobile devices. This dataset features a gray–white background with a large gray-scale contrast between the background and fissure targets, and representative examples are shown in Figure 2.

3. Methods

3.1. MGF-UNet Architecture

As illustrated in Figure 3, the MGF-UNet architecture adopts a hierarchical encoder–decoder layout specifically tailored for the fine-grained segmentation of surface fissures. Given an input UAV image $[eqn]$ , the encoder undergoes four downsampling stages, where we configured the channel dimensions at 64, 128, 256, and 512, respectively. This setup progressively scales the spatial resolution down to 1/4, 1/8, 1/16, and 1/32 of the original input, enabling a systematic transition from local geometric feature extraction to high-level global semantic representation.

In the shallow encoding stage, the network focuses on fine-grained modeling of local fissure morphology and boundary structures. To this end, a multi-scale feature sensing module is introduced, in which asymmetric convolutions and dilated convolutions are jointly employed to expand the effective receptive field, thereby enhancing the network’s responsiveness to fissure structures at different scales. Meanwhile, the efficient multi-scale attention (EMA) mechanism [34] is incorporated to strengthen cross-spatial interactions between fissure regions and the background, improving the consistency and discriminability of fissure-related features.

As the feature resolution is further reduced, the network emphasizes global context modeling in the deep encoding stage. At this stage, a Token-Selective Context Transformer is introduced to explicitly select key tokens that are highly relevant to fissure regions, thereby effectively suppressing the interference of background areas during attention computation and enhancing the stability of global semantic modeling. In addition, a feature-wise linear modulation (FiLM) module [35] is employed to generate a set of channel-wise affine modulation parameters $[eqn]$ from shallow-level local structural features, which are used to modulate deep semantic representations. This design alleviates the over-smoothing effect introduced by downsampling and enables collaborative fusion of local information and global semantic representations.

In the decoding stage, the network employs a Fourier transform-based adaptive feature fusion module to integrate upsampled features with skip-connected features. Feature enhancement is performed in the frequency domain to suppress artifacts and background noise while strengthening the high-frequency responses associated with fissure boundaries. Based on this architectural design, MGF-UNet achieves stable and fine-grained segmentation of mining-area surface fissures by effectively balancing global context modeling capability with local structural preservation.

3.2. MFS (Multi-Scale Feature Sensing)

Ground fissures typically exhibit elongated morphologies with large variations in scale and are often accompanied by blurred boundaries. Relying solely on single-scale convolutions can easily lead to the loss of fine fissure details or incomplete boundary delineation. To address this issue, As illustrated in Figure 4, a multi-scale feature sensing (MFS) module is introduced. Within this module, asymmetric convolutions (1 × 5 and 5 × 1) and dilated convolutions (with a dilation rate of 3) are jointly employed to enhance linear texture responses of fissures while expanding the effective receptive field. In addition, depth-wise separable convolution (DSConv) [36] is incorporated via residual connections to further strengthen edge extraction capability. Compared with deeper network layers, shallow layers preserve higher spatial resolution and are therefore more suitable for capturing local information [37]. Accordingly, the MFS module is deployed in the first two encoding stages, enabling effective extraction of fine fissure features and complex spatial structures from high-resolution feature maps.

[eqn]

[eqn]

[eqn]

3.3. EMA (Efficient Multi-Scale Attention)

Although multi-scale convolutions can partially mitigate information loss, they do not explicitly model the importance of feature channels and spatial locations. In ground-fissure semantic segmentation, features from different channels and positions contribute unequally to fissure semantics. Conventional channel-attention mechanisms can model cross-channel dependencies and highlight salient cues, but they often incur feature loss and lack dynamic regulation; consequently, critical edge and connectivity information is easily diluted by background noise, degrading discrimination in complex scenes.

To address these issues, we introduce efficient multi-scale attention (EMA), which combines channel grouping with cross-spatial interaction to achieve lightweight multi-scale enhancement. As illustrated in the Figure 5, EMA partitions the input channels into several subgroups and models them independently to reduce computational cost while preserving multi-scale expressiveness; adaptive pooling along the horizontal and vertical directions, followed by $[eqn]$ convolutions, captures cross-spatial dependencies, thereby reinforcing the directional coherence of strip-like fissures and improving boundary discrimination. Finally, a parallel interaction pathway fuses global context with local convolutional details, enabling effective feature integration and multi-scale aggregation.

Specifically, for an input feature map $[eqn]$ , EMA partitions its channels into G groups, each of size C/G, yielding the grouped feature $[eqn]$ ; EMA then performs horizontal and vertical pooling on each grouped feature $[eqn]$ to extract transverse and longitudinal contextual information of strip-like fissures:

[eqn]

The two pooled maps are fused by a $[eqn]$ convolution and mapped to $[eqn]$ to obtain the spatial attention weights $[eqn]$ , which are then used to modulate the grouped features at the pixel level:

[eqn]

Meanwhile, a $[eqn]$ convolution is applied to $[eqn]$ to extract local boundary and high-frequency features $[eqn]$ :

[eqn]

To further capture cross-spatial dependencies, EMA applies global average pooling and normalization to $[eqn]$ and $[eqn]$ to obtain attention weights; these are then interacted to produce the final cross-scale feature attention:

[eqn]

Finally, the learned weights are fused with the grouped features to produce the group-wise output $[eqn]$ ; concatenating all subgroups along the channel dimension yields the final output Y:

[eqn]

Through channel grouping, EMA substantially reduces the computational burden; via spatial gating and channel interaction, it enhances the multi-scale representation of slender fissures and improves boundary sharpness and connectivity.

3.4. TSCT (Token-Selective Context Transformer)

Transformers have exhibited remarkable capability in modeling global contextual information for visual recognition tasks [39]. Nevertheless, the conventional self-attention mechanism assigns equal importance to all tokens, which makes it highly susceptible to interference from a large number of background tokens. As a result, the discriminative representation of weakly salient structures, such as thin and elongated ground fissures, is substantially degraded [40]. To overcome this limitation, this study proposes a Token-Selective Context Transformer (TSCT). As illustrated in Figure 6, TSCT introduces an explicit token selection strategy at the deep, low-resolution feature stage, enabling selective global semantic modeling that emphasizes fissure-relevant tokens while suppressing background-induced attention noise.

Given an input feature representation $[eqn]$ , where $[eqn]$ denotes the number of tokens and C represents the channel dimension, linear projections are first applied to generate the query matrix Q, key matrix K, and value matrix V:

[eqn]

where $[eqn]$ are learnable linear projection parameters and d denotes the dimensionality of the attention subspace. To emphasize key regions that are highly relevant to fissures, TSCT introduces a token selection weighting function, $[eqn]$ , which assigns an importance weight to each token:

[eqn]

$[eqn]$ consists of two fully connected layers followed by a sigmoid activation function, and the resulting weights are used to modulate the query and key representations in a token-wise manner.

[eqn]

where $[eqn]$ denotes a diagonal matrix constructed from the token importance weights, which is used to reweight the query and key matrices. Based on the modulated queries and keys, global contextual features are then computed using the standard scaled dot-product attention mechanism:

[eqn]

Following the self-attention operation, TSCT further enhances feature representation by employing an improved hybrid feed-forward network. Specifically, a depth-wise separable convolution (DWConv) is introduced to incorporate spatial awareness, which is combined with squeeze-and-excitation (SE) channel attention [41] for adaptive feature re-calibration. Subsequently, a gated linear unit (GLU) [42] is applied in an expanded feature space to perform effective information filtering. The overall transformation process can be formulated as

[eqn]

where $[eqn]$ denotes the depth-wise convolution operation, $[eqn]$ represents channel attention, $[eqn]$ denotes the gated linear unit, and $[eqn]$ is the output feature. The TSCT module adopts a Pre-Norm architecture, in which Layer Normalization is applied before both the self-attention submodule and the feed-forward submodule, together with residual connections to ensure gradient stability during deep network training.

3.5. FiLM (Feature-Wise Linear Modulation)

Although TSCT is capable of effectively modeling long-range dependencies and integrating global contextual information, the modeling process may cause response homogenization across different spatial locations and semantic channels, leading to an over-smoothing effect that weakens the discriminability of thin and elongated fissure structures. In the deep encoding stage, let the deep feature representation of the encoder be denoted as $[eqn]$ , where $[eqn]$ indexes spatial locations and c indexes feature channels. This feature representation is modulated by the feature-wise linear modulation (FiLM) mechanism to facilitate subsequent global context modeling, as illustrated in Figure 7.

The feature modulation module extracts a conditional representation vector $[eqn]$ from the intermediate and shallow features $[eqn]$ , which serves as a modulation signal to guide the re-calibration of deep feature representations. Specifically, channel-wise feature modulation parameters are first predicted from the conditional features through a learnable mapping function:

[eqn]

where $[eqn]$ and $[eqn]$ denote learnable linear layers that generate the corresponding scaling coefficients and bias terms for each channel, respectively, enabling channel-wise affine modulation of the deep feature representations.

[eqn]

It is worth emphasizing that the modulation parameters $[eqn]$ depend only on the channel index $[eqn]$ and are shared across all spatial locations $[eqn]$ . Through condition-dependent channel-wise re-calibration, the overall distribution of deep features is globally adjusted. By this mechanism, prior information provided by intermediate and shallow features enhances fissure-related channel responses while suppressing the accumulation of irrelevant features during subsequent global context modeling.

3.6. AFF (Adaptive Feature Fusion)

Ground fissure targets are generally characterized by elongated shapes, structural discontinuity, and susceptibility to interference from texture shadows. Conventional cross-layer feature fusion tends to amplify high-frequency background noise from encoder features, leading to false detections, while also causing excessive smoothing of deep features, which degrades boundary sharpness and structural connectivity [42]. To address these issues, a Fourier transform-based adaptive feature fusion (AFF) module is introduced, as illustrated in Figure 8. The AFF module consists of three sequential steps—cross-feature fusion [43], frequency-domain enhancement [44], and spatially refined reconstruction—which jointly improve boundary clarity and fissure connectivity without introducing significant computational overhead.

Given the upsampled feature $[eqn]$ in the decoding stage and the skip-connected feature $[eqn]$ from the encoder, the overall mapping of AFF can be expressed as

[eqn]

Before frequency-domain enhancement, AFF first performs cross filtering of the two feature branches in the spatial domain to suppress background noise introduced by unilateral activations. Specifically, pixel-wise weighting maps are generated via a $[eqn]$ convolution followed by a sigmoid activation, and these weights are used to modulate the features of the opposite branch:

[eqn]

[eqn]

The filtered features are then fused to obtain the spatial-domain fused feature $[eqn]$ :

[eqn]

This cross-filtering mechanism ensures that only spatial locations jointly activated by both feature branches are significantly enhanced, thereby achieving preliminary suppression of background noise in the spatial domain.

After cross-feature fusion, the fused feature $[eqn]$ is subjected to a two-dimensional Fourier transform (FFT), mapping it into the frequency domain to obtain the real part $[eqn]$ and the imaginary part $[eqn]$ of the complex spectrum. These components are then concatenated along the channel dimension to form the frequency-domain feature representation $[eqn]$ :

[eqn]

[eqn]

After the frequency-domain enhancement, $[eqn]$ is split along the channel dimension into updated real and imaginary components. An inverse Fourier transform is then applied to reconstruct the enhanced frequency-domain representation back into the spatial domain, yielding the feature $[eqn]$ :

[eqn]

[eqn]

Finally, to further stabilize the feature representation, a $[eqn]$ convolution–normalization–activation (CBR) block is applied, producing the final AFF output $[eqn]$ .

4. Loss Functions and Accuracy Metrics

4.1. Loss Functions

In mining-area surface fissure semantic segmentation tasks, fissure regions serve as foreground targets, yet their pixel proportion is far lower than that of the background, resulting in a highly pronounced class imbalance. To address this limitation, this study adopts the Focal Tversky Loss as the loss function for ground fissure segmentation. Tversky Loss allows flexible control over the contributions of true positives (TP) and false negatives (FN) by adjusting the parameters $[eqn]$ and $[eqn]$ , effectively assigning higher weights to hard-to-segment regions and encouraging the network to focus on these challenging areas. Furthermore, by incorporating the focal concept [45] into Tversky Loss [46], different weights are assigned to different classes, such that misclassified fissure pixels receive greater penalty. This design effectively alleviates the class imbalance problem and leads to improved segmentation performance. The corresponding confusion matrix is shown in Table 1.

[eqn]

where $[eqn]$ , $[eqn]$ , and $[eqn]$ denote the numbers of true positive, false positive, and false negative pixels, respectively. The parameters $[eqn]$ and $[eqn]$ are used to adjust the relative penalties assigned to $[eqn]$ and $[eqn]$ , while $[eqn]$ is the focal modulation factor that enhances the model’s sensitivity to hard-to-classify samples. In this study, the parameters are set to $[eqn]$ = 0.7, $[eqn]$ = 0.3, and $[eqn]$ = 0.75, following the configuration adopted in MFPA-Net proposed by Jiang [24] for automatic mining-area fissure extraction. This parameter setting has also been demonstrated to be effective in multiple studies, such as the application of this loss function in medical image segmentation by Abraham [46].

4.2. Accuracy Metrics

To comprehensively evaluate the performance of the model in mining-area surface fissure semantic segmentation tasks, multiple evaluation metrics are employed to assess segmentation accuracy from different perspectives, including Intersection over Union (IoU), Precision, Recall, Dice coefficient, and Boundary F1 score.

IoU measures the degree of overlap between the predicted regions and the ground-truth annotations, directly reflecting the localization accuracy of fissure centerlines and boundaries. Precision indicates the correctness of the predicted results, representing the proportion of pixels predicted as fissures that are truly fissure pixels; a high Precision value suggests that the model can effectively distinguish fissures from pseudo-fissure interference. Recall focuses on the proportion of true fissure pixels that are successfully detected, with higher Recall indicating fewer missed detections and better preservation of fissure connectivity. The Dice coefficient, as the harmonic mean of Precision and Recall, balances the trade-off between false negatives and false positives, and is therefore widely regarded as an important overall performance metric for crack detection tasks involving minority classes and small targets. Boundary F1 is used to quantitatively evaluate the model’s accuracy in fissure boundary localization.

5. Experimental Results

5.1. Experimental Details

The deep learning framework was implemented using PyTorch 1.13.1 with Python 3.8. To ensure fairness in comparative experiments, all segmentation models were trained on the same datasets using identical hyperparameter settings and input configurations. Specifically, all models were trained for 100 epochs with a batch size of 8, an initial learning rate of $[eqn]$ , and the Adam optimizer was employed for parameter optimization. For the MFD and Crack500 datasets, the input resolution was set to $[eqn]$ . Considering that the RFD dataset exhibits a relatively uniform background and lower environmental noise, all models were further trained and evaluated using an input resolution of $[eqn]$ on this dataset.

5.2. Mining-Area Surface Fissure Dataset Evaluation and Large-Scale Real-World Fissure Extraction Validation

In this study, MGF-UNet is compared with seven representative methods on the MFD dataset, including U-Net [47], DeepLabv3+ [48], SegFormer [49], Swin-UNet [50], PSPNet [51], SegNet [52], and the diffusion-based model SegDiff [53]. The qualitative fissure extraction results of all methods are presented in Figure 9, where enlarged views of selected fissure regions are provided on the right side to facilitate a visual comparison. Overall, all semantic segmentation models are able to produce reasonable segmentation results to some extent. However, the results obtained by MGF-UNet not only preserve the structural integrity of fissures but also achieve more accurate and refined boundary delineation. In contrast, Swin-UNet tends to generate coarse boundaries and introduces spurious noise outside fissure regions when attempting to recover edge details, while the diffusion-based SegDiff model demonstrates relatively strong capability in extracting overall fissure structures.

Quantitative evaluation results on the MFD dataset are summarized in Table 2. DeepLabv3+ achieves the highest Recall of 84.9%, outperforming the other methods, while SegDiff attains the highest Precision of 80.1%. The proposed MGF-UNet achieves a Precision of 78.2%, a Dice score of 81.4%, an IoU of 68.6%, and a Boundary F1 score of 86.6%. These results collectively indicate that MGF-UNet delivers superior overall segmentation quality compared with the other classical segmentation networks.

To comprehensively validate the fissure extraction performance of MGF-UNet in real-world large-scale scenarios, five sample regions containing surface fissures were selected within the mining area, as shown in Figure 10. These test regions are predominantly affected by vegetation coverage and thus effectively reflect the model’s adaptability under complex real-world conditions. The pretrained weights obtained by training MGF-UNet on the MFD dataset were directly applied to extract fissures in these regions, and the corresponding results are presented in Figure 11. From the visual inspection, MGF-UNet is able to successfully extract fissures in complex environments, demonstrating a certain degree of robustness against surface-related interference. However, the experimental results also reveal inherent limitations of the model. Owing to the similarity between fissures and vegetation shadows in terms of gray-scale and texture characteristics, some shadow regions are misclassified as fissures in the predictions for Site 1, Site 3, and Site 5.

The comparative results of surface crack extraction by MGF-UNet and baseline models at Site 5 are illustrated in Figure 12. MGF-UNet demonstrates the strong suppression of non-crack noise (e.g., vegetation shadows) while maintaining high accuracy in identifying genuine cracks. Nevertheless, false detections still occur in areas with prominent topographic relief. This is primarily because terrain-induced shadows and actual cracks share highly similar spatial morphology and gray-scale distribution, a factor widely recognized as a major bottleneck in automated surface crack extraction [54,55].

5.3. Road Crack Dataset Identification

5.3.1. Comparative Experiments on RFD

To evaluate the reliability and applicability of MGF-UNet, comparative experiments were further conducted on the self-constructed road fissure dataset (RFD) and the public pavement dataset Crack500 [56]. The RFD dataset was created from high-resolution images captured using mobile devices. After cropping the original images and discarding samples without fissures, a total of 2086 fissure images with a resolution of $[eqn]$ pixels were retained and split into training and test sets at a ratio of 5:1. Figure 13 presents a visual comparison of fissure extraction results on the RFD dataset between MGF-UNet and other methods. Enlarged views of selected regions (highlighted by red boxes) are shown on the right, where it can be observed that MGF-UNet produces more accurate segmentation results with finer and clearer fissure boundaries compared with the competing approaches.

The quantitative evaluation results are summarized in Table 3. Although the Recall of MGF-UNet is slightly lower than that of DeepLabv3+, it achieves a Precision of 84.1%, an Dice-score of 89.4%, an IoU of 80.8%, and a Boundary F1 (BF1) score of 85.1%. Notably, on the RFD dataset, the diffusion-based model SegDiff does not outperform the proposed method. The overall performance of MGF-UNet on RFD is consistent with its behavior on the MFD dataset, while both Precision and Recall are higher than those obtained on MFD. This improvement can be attributed to the relatively uniform gray background and reduced noise interference in the RFD dataset. Overall, the results demonstrate that MGF-UNet exhibits strong capability in extracting fine fissures and accurately delineating fissure boundaries.

5.3.2. Comparative Experiments on the Crack500 Dataset

The Crack500 dataset [56] is a concrete pavement crack dataset collected by Temple University using mobile devices. It consists of 1896 training images and 1124 test images, with an original image resolution of 640 × 360 pixels. The images are resized to 256 × 256 pixels and then divided into training and validation sets. Figure 14 presents a visual comparison of fissure extraction results on the Crack500 dataset between MGF-UNet and other methods. As shown in the results, MGF-UNet demonstrates more accurate segmentation performance on complex and blurred cracks compared with the competing networks. In particular, for the fine-scale cracks highlighted in the red boxes, MGF-UNet produces segmentation results that are closer to the ground truth, indicating its superior sensitivity to subtle fissure structures and stronger robustness against noise.

The quantitative evaluation results are summarized in Table 4. MGF-UNet achieves the highest Precision of 69.7%. Although its Recall is slightly lower than that of DeepLabv3+, MGF-UNet outperforms all competing models in terms of Dice-score and IoU, reaching 76.3% and 61.7%, respectively, with a Boundary F1 (BF1) score of 73.2%. As Crack500 is a public dataset with substantially more complex background conditions than the RFD dataset, the overall performance of all methods is relatively lower than that achieved on RFD. Nevertheless, MGF-UNet consistently delivers superior overall performance compared with other segmentation approaches, particularly in the extraction of fine fissures highlighted in the red boxed regions on the right side of the figures.

5.4. Model Complexity Comparison

Table 5 reports four metrics used to evaluate the computational efficiency of different network models on an NVIDIA GeForce RTX 4060 Ti, including the number of parameters, average inference latency, floating-point operations per second (FLOPs), and frames per second (FPS). The parameter count, measured in millions (M), reflects model complexity; average latency denotes the mean time required to complete a single forward inference; FLOPs are a classical indicator of computational complexity; and FPS represents runtime efficiency. As shown in the table, DeepLabv3+ with a MobileNetV2 backbone has the smallest number of parameters, while SegFormer with a MiT-B1 encoder requires only 7.106 ms for a single forward pass, achieving an average throughput of 140.72 FPS and the lowest FLOPs (3.454 G). SegNet exhibits the highest runtime efficiency. By contrast, MGF-UNet, while delivering superior prediction performance, maintains a parameter count comparable to SwinUNet, with model complexity second only to SegFormer and runtime efficiency at a medium-to-high level. Specifically, MGF-UNet has 22.27 M parameters, an average latency of 8.752 ms, an FPS of 114.26, and FLOPs of 4.002 G; qualitative efficiency comparisons are illustrated in Figure 15. The diffusion-based model SegDiff employs 100 diffusion steps for iterative denoising during inference, resulting in substantially longer inference time per image than other segmentation models. Consequently, SegDiff exhibits clear limitations in scenarios with stringent real-time requirements and is difficult to deploy in practical applications.

5.5. Ablation Experiments

To evaluate the effectiveness of the key components in MGF-UNet, a series of ablation experiments were conducted on the MFD dataset, as summarized in Table 6. U-Net was adopted as the baseline, and the MFS, EMA, TSCT, FiLM, and AFF modules were progressively incorporated from Experiment 2 to Experiment 6. All models were trained independently under identical training configurations to ensure a fair comparison. Except for minor fluctuations observed in Precision and Recall, the remaining metrics exhibit a consistent upward trend. Compared with the baseline model, Precision, Recall, Dice, IoU, and Boundary F1 are improved by 4.8%, 0.5%, 3.2%, 4.4%, and 1.9%, respectively.

5.6. Generalization Study on Non-Crack Data

To further evaluate the non-crack generalization capability of MGF-UNet, we conducted experiments on the ISPRS Vaihingen [57] semantic segmentation dataset and compared against two general-purpose segmentation models, DeepLabv3+ and SegFormer. ISPRS Vaihingen is a widely used remote sensing benchmark comprising 33 images of varying sizes with six classes: impervious surfaces, buildings, low vegetation, trees, cars, and background. The images were tiled to $[eqn]$ pixels, yielding 1746 training tiles and 508 test tiles. Figure 16 presents qualitative comparisons. Overall, MGF-UNet achieves segmentation accuracy comparable to DeepLabv3+, with mean IoU values of 76.7% and 76.9% respectively. Pixel accuracy reaches 83.5% for MGF-UNet, slightly below the 85.9% recorded by DeepLabv3+.

6. Discussion

6.1. General Interpretation

In UAV–visible imagery, mining-area surface fissures are highly susceptible to interference from vegetation occlusion, strong shadows, and visually similar linear objects (e.g., farmland boundaries), which often results in blurred boundaries, disrupted fissure connectivity, and reduced extraction accuracy. To comprehensively evaluate segmentation performance under such challenging conditions, comparative experiments using different segmentation methods were conducted on the mining-area fissure dataset (MFD) as well as two road crack datasets (RFD and Crack500).

From the experimental results, it can be observed that different models exhibit markedly different behaviors across datasets. The diffusion-based model SegDiff achieves higher segmentation accuracy on mining-area fissures than on the two road crack datasets. This indicates that SegDiff is more effective at extracting structurally simple fissures than those with complex, interlaced distributions and is less sensitive to noise. This advantage can be attributed to its generative modeling paradigm based on the diffusion process, which learns the overall morphological distribution of fissures through iterative denoising. In contrast, MGF-UNet consistently maintains high fissure connectivity across all datasets while simultaneously recovering finer boundary details, demonstrating more stable and robust performance under varying scene conditions.

From an architectural perspective, each key component of MGF-UNet plays a crucial role in enhancing fissure segmentation performance. The MFS module strengthens the representation of fissure boundaries through multi-scale perception. EMA captures spatial dependencies along horizontal and vertical directions via feature grouping and cross-spatial interaction, thereby reinforcing the recognition of strip-like fissure patterns. TSCT improves the structural integrity of fissures by preserving informative tokens and modeling long-range dependencies. FiLM injects shallow-layer local details into deep semantic representations through feature modulation, enabling efficient local–global feature fusion. In addition, AFF performs adaptive frequency-domain feature fusion based on Fourier transforms, enhancing high-frequency cues such as edge textures and improving the perception of small targets and elongated structures. Through the collaborative effects of these modules, MGF-UNet achieves accurate fissure extraction even under strong vegetation interference.

6.2. Limitations and Future Work

The monitoring paradigm developed in this study, which synergizes UAV-borne remote sensing with the MGF-UNet framework, demonstrates substantial practical utility for the high-fidelity characterization of surface hazards. Primarily, the centimeter-level spatial resolution provided by UAV-borne sensors overcomes the detection limitations of traditional remote sensing imagery, enabling the proposed model to capture fine-grained fissure characteristics and providing critical data support for early surface hazard warning. Furthermore, the high operational flexibility and cost-effectiveness of UAVs allow for on-demand deployment without the constraints of fixed orbital cycles or complex terrains. These advantages have established UAV-based observations as a primary tool for rapid post-event surveys following the occurrence of surface hazards. In the context of existing UAV-based geological investigations, Darmawan et al. [10] performed morphological analysis on volcanic fissures using high-resolution DEMs and orthomosaics; however, their approach is constrained by laborious procedures and extended monitoring cycles. For landslide hazards, Cheng et al. [8] leveraged deep transfer learning to integrate landslide boundaries with surface crack features for identification. Nevertheless, capturing the intricate topological characteristics of fine-grained fissure edges remains a technical bottleneck for their segmentation framework. Additionally, Cirillo et al. [9] investigated active fault fissures through soil profile sampling, yet the prohibitive labor costs and logistical constraints limit its applicability for rapid, large-scale fissure extraction in deformation-affected areas. In contrast, the integration of UAV remote sensing with our MGF-UNet semantic segmentation framework achieves pixel-level precision and superior cost-effectiveness across diverse surface backgrounds, offering a more versatile and automated solution for the fine-grained characterization of surface hazards.

Beyond the integration of multi-source remote sensing, future research should prioritize a multidisciplinary framework that synergizes geodetic observations with geophysical surveys. By incorporating techniques such as Electrical Resistivity Tomography (ERT), Ground Penetrating Radar (GPR), or seismic methods, surface fissure morphologies can be cross-referenced with subsurface resistivity anomalies or dielectric discontinuities. This synergy between surficial and deep-seated datasets facilitates a comprehensive 3D characterization of fissure development mechanisms and their evolutionary trajectories. Such an integrated monitoring strategy is pivotal for refining risk mitigation protocols and advancing the predictive accuracy of early-warning systems in complex geological environments.

7. Conclusions

This study addresses the potential hazards associated with mining-area ground fissures and proposes an encoder–decoder-based semantic segmentation network, termed MGF-UNet. The network integrates key designs such as a Token-Selective Context Transformer (TSCT) and a multi-scale feature sensing module (MFS) to enhance automatic fissure extraction under complex coal mining surface environments. Comparative experiments conducted on UAV-acquired mining-area fissure datasets against multiple mainstream semantic segmentation methods demonstrate that MGF-UNet achieves significant advantages in both overall fissure extraction performance and fine-grained boundary delineation. Furthermore, additional experiments on road crack datasets, including RFD and Crack500, further verify the effectiveness of MGF-UNet for fissure extraction tasks and its stability across different scene types. In large-scale real-world applications, MGF-UNet maintains high extraction accuracy and computational efficiency, highlighting its strong potential for practical engineering applications. Experimental results also confirm that modules such as MFS and TSCT effectively improve the network’s fissure extraction capability.

Although this study has made progress in extracting surface fissures under complex environmental conditions, it is still constrained by data-related limitations. For instance, interference from surface vegetation coverage leads to insufficient identification of concealed fissures. Future work will explore the fusion of multi-source remote sensing data, such as optical and thermal infrared imagery, in combination with multi-temporal change detection methods to further enhance the robustness of the model in complex environments and improve its ability to characterize fissure evolution patterns.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Fu Y. Wu Y. Yin X. Zhang Y. Mapping mining-induced ground fissures and their evolution using UAV photogrammetry Front. Earth Sci.202311126091310.3389/feart.2023.1260913 · doi ↗
2Liu Y. Zhang D. Wang G. Liu C. Zhang Y. Discrete element method-based prediction of areas prone to buried hill-controlled earth fissures J. Zhejiang Univ. Sci. A 20192079480310.1631/jzus.A 1900292 · doi ↗
3Wang K. Wei B. Zhao T. Wu G. Zhang J. Zhu L. Wang L. An automated approach for mapping mining-induced fissures using CN Ns and UAS photogrammetry Remote Sens.202416209010.3390/rs 16122090 · doi ↗
4Xu D. Zhao Y. Jiang Y. Zhang C. Sun B. He X. Using improved edge detection method to detect mining-induced ground fissures identified by unmanned aerial vehicle remote sensing Remote Sens.202113365210.3390/rs 13183652 · doi ↗
5Zou S. Xiong F. Luo H. Lu J. Qian Y. AF-Net: All-scale feature fusion network for road extraction from remote sensing images Proceedings of the Digital Image Computing: Techniques and Applications (DICTA)Gold Coast, Australia 29 November–1 December 20211810.1109/DICTA 52665.2021.9647235 · doi ↗
6Montagnon T. Hollingsworth J. Pathier E. Marchandon M. Dalla Mura M. Giffard-Roisin S. Sub-pixel optical satellite image registration for ground deformation using deep learning Proceedings of the IEEE International Conference on Image Processing (ICIP)Bordeaux, France 16–19 October 20222716272010.1109/ICIP 46576.2022.9897214 · doi ↗
7Zhang Z. Zhu L. A review on unmanned aerial vehicle remote sensing: Platforms, sensors, data processing methods, and applications Drones 2023739810.3390/drones 7060398 · doi ↗
8Cheng Z. Gong W. Jaboyedoff M. Chen J. Derron M.-H. Zhao F. Landslide Identification in UAV Images Through Recognition of Landslide Boundaries and Ground Surface Cracks Remote Sens.202517190010.3390/rs 17111900 · doi ↗