A Cross-Layer Feature Fusion Framework with Hierarchical Interaction for Remote Sensing Change Detection
Xin Meng, Chuanbiao Qiu, Chong Liu, Yanli Xu

TL;DR
This paper introduces a new framework for detecting changes in high-resolution satellite images by improving how different layers of image features work together.
Contribution
The novel CLFF framework with MP-Block enhances cross-layer feature fusion for more accurate remote sensing change detection.
Findings
CLFF outperforms baseline models on four benchmark datasets with performance improvements in IoU ranging from 1.35% to 4.85%.
The MP-Block improves feature interaction and suppresses false changes caused by illumination and background clutter.
The lightweight attention module enhances spatial responses and highlights key change-related information.
Abstract
The rapid progress of remote sensing (RS) and computer vision has greatly advanced change detection (CD), and hybrid architectures combining Transformers and convolutional neural networks (CNNs) have shown strong potential in recent years. Nevertheless, reliable CD for very high-resolution (VHR) imagery remains challenging due to large appearance variations across acquisition times, complex background clutter, and target structural diversity. These factors often hinder the modeling of fine edge textures, the maintenance of feature continuity, and the suppression of false changes caused by illumination fluctuations. To address these issues, this paper proposes a Cross-layer Feature Fusion Framework (CLFF) that achieves more accurate and stable change detection by explicitly enhancing the collaborative fusion capability of multi-layer features. The core component of this framework is the…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9- —National Natural Science Foundation of China
- —Key Program of the Joint Funds of the National Natural Science Foundation of China
- —Innovation Program of Shanghai Municipal Education Commission
- —Key Technology R&D Plan of the Science and Technology Commission of Shanghai Municipality
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Remote Sensing in Agriculture · Advanced Neural Network Applications
1. Introduction
Driven by both natural processes and anthropogenic activities, the Earth’s surface has exhibited increasingly dynamic characteristics in recent years, making time-series analysis of RS data an indispensable component of modern Earth observation systems. In parallel, RS interpretation techniques have rapidly advanced, ranging from polarimetric scattering analysis [1,2,3] to high-level semantic extraction from optical imagery. Among these developments, CD has emerged as a fundamental task that identifies surface transitions through the comparison of multitemporal images [4]. With the growing availability of VHR imagery, CD has been widely applied in urban planning [5], land-use assessment [6], and disaster evaluation [7].
VHR remote sensing imagery provides abundant spatial and structural details for change analysis. In practice, however, effectively exploiting such information remains challenging. Complex scenes often exhibit diverse appearance variations that are unrelated to actual changes, leading to significant visual ambiguity and making it difficult to distinguish genuine structural modifications from superficial surface differences. As illustrated in Figure 1, common sources of such appearance-induced pseudo-changes include cast shadows, seasonal vegetation growth, transient objects (e.g., vehicles or ships), and surface discolorations. These factors frequently trigger false alarms and compromise boundary integrity in change detection results. While such appearance variations are often prominent in shallow feature layers, they lack consistent semantic meaning. This discrepancy highlights the necessity of explicitly coordinating semantic and structural information across feature hierarchies. In addition to the inherent ambiguity of VHR imagery, the large data volume and hardware constraints further necessitate careful trade-offs between detection accuracy and computational efficiency in real-world systems [8].
Early works on remote sensing change detection were primarily conducted at the pixel level due to computational efficiency constraints [9]. In this context, transformation- and multi-resolution-based analyses—such as PCA [10], wavelet transforms [11], and CVA [12]—were widely adopted. Pixel-level classifiers, such as SVM [13] and RF [14], have also been employed, often in combination with handcrafted texture features such as GLCM [15]. While effective in homogeneous scenes, these methods depend on manual feature engineering and lack robustness in complex urban settings. From a task formulation perspective, CD differs fundamentally from related remote sensing tasks such as object detection and hyperspectral unmixing. While object detection focuses on localizing targets in single static images [16] and hyperspectral unmixing aims to separate mixed spectral signatures [17], CD requires explicit bi-temporal modeling to distinguish genuine semantic changes from variations caused solely by appearance differences.
To alleviate these limitations, Transformer-based architectures have recently been introduced into remote sensing change detection to enhance global context modeling. Through the self-attention mechanism, Transformers are able to capture long-range dependencies and explicitly model global structural relationships between bi-temporal images. Representative approaches, such as the Bi-temporal Image Transformer (BIT) [18], hierarchical Transformer-based Siamese networks [19], and hybrid CNN–Transformer frameworks with multi-scale token aggregation [20], demonstrate the advantages of integrating local texture modeling with global context perception. Nevertheless, Transformer-based methods generally incur high computational and memory costs, and patch-wise tokenization may compromise the preservation of fine-grained spatial details, thereby limiting their applicability to very high-resolution imagery and resource-constrained scenarios.
Beyond global context modeling, multi-scale feature fusion has been widely adopted to integrate fine-grained spatial details from shallow layers with high-level semantic information from deep layers, thereby improving robustness against complex background interference. Representative studies enhance change perception through cascading multi-scale features with difference enhancement [21], injecting shallow details into deep semantic representations for boundary refinement [22], or explicitly modeling bitemporal spatial relationships to alleviate misalignment-induced inconsistencies [23]. More recently, hierarchical and interaction-aware fusion strategies have attracted increasing attention, including the lightweight interlayer correlation enhancement design proposed by Xiao et al. [24], as well as representative CNN-based fusion frameworks such as IFNet [25] and CGNet [26]. However, many existing fusion methods rely on fixed or loosely coordinated aggregation mechanisms, which may amplify redundant responses and compromise the boundary integrity of fused feature representations, particularly when processing fine-grained or thin-structure changes.
Motivated by the above analysis, we identify that a key bottleneck in VHR change detection lies in aligning high-level semantics with low-level structural details under complex backgrounds. Although prior CNN/Transformer and fusion-based methods improve local textures, global context, or multi-scale aggregation, the interaction between deep and shallow features is often insufficiently coordinated, leading to redundant activations and blurred or fragmented boundaries for fine-grained changes. To this end, we propose the CLFF framework, which explicitly organizes cross-layer feature interaction and couples it with coordinated semantic–spatial refinement to mitigate the semantic–structural mismatch and enhance multi-scale change representation.
The main contributions of this work are summarized as follows:
- We propose a novel CLFF that explicitly organizes cross-layer interaction and hierarchical refinement to address the semantic–structural mismatch between deep and shallow features in remote sensing change detection.
- We design an MP-Block to progressively integrate hierarchical features and facilitate effective information flow across adjacent semantic levels.
- We develop a MIFM as the core fusion backbone, which is composed of the RFRB and ACGF units to jointly perform response-aware feature refinement and adaptive channel-wise recalibration.
- Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and robustness of the proposed method under diverse and challenging scenarios.
2. Materials and Methods
2.1. Overall Framework
Remote sensing change detection aims to accurately identify pixel-level changes between bi-temporal images and can be regarded as a specialized form of semantic segmentation. However, complex backgrounds, small object scales, and appearance variations caused by illumination and seasonal factors often lead to unstable change representations, making reliable change identification challenging. To address these challenges, we design CLFF to enable explicit interactions across different semantic levels, as illustrated in Figure 2.
The CLFF Framework consists of four components: a Siamese VGG16 encoder, a bitemporal feature alignment module, an MP-Block with multi-level interaction perception, and a lightweight decoder. Together, these modules facilitate the extraction and fusion of change-relevant information across multiple depths and spatial scales, thereby improving representation stabilty and detection accuracy.
Specifically, CLFF adopts a Siamese VGG16 encoder with shared weights to extract hierarchical feature representations from pre- and post-change images. Features from multiple stages are retained to preserve fine-grained spatial details at shallow layers while progressively encoding higher-level semantic information at deeper layers. However, slight misregistration and geometric inconsistencies between bi-temporal inputs may still exist in practice. To alleviate this issue, a lightweight BFAB is introduced prior to feature fusion to enhance spatial consistency between corresponding features, as illustrated in Figure 2b.
The aligned multi-level features are then fed into the proposed MP-Block for progressive cross-layer fusion and coordinated refinement. Within the MP-Block, adjacent semantic features explicitly interact and are enhanced to achieve joint modeling of semantic consistency and structural continuity. Finally, the refined features are processed by a lightweight decoder through convolutional operations and bilinear upsampling to recover spatial resolution and generate the pixel-wise change map. During decoding, high-level features are progressively upsampled and fused into finer-scale representations in a top-down manner, yielding the final fine-scale fused feature for change map prediction.
In summary, CLFF uses a Siamese encoder with shared weights to obtain hierarchical feature representations of the pre- and post-change images, ensuring that both inputs are processed by an identical feature extractor. In our implementation we use VGG16 as a backbone for the feature extraction network in order to capture multi-scale spatial and semantic information at different stage. Specifically, the encoder generates a group of feature maps that have successively smaller spatial resolutions but greater levels of semantic abstraction; these feature maps contain abundant structural information at shallower depths and higher-level semantic hints at greater depths. These multi-level features are then fed into the following bi-temporal feature alignment and cross-layer fusion modules, serving as the basis for cross-layer interaction and cooperative refinement.
2.2. Multi-Level Interaction Perception Block (MP-Block)
The MP-Block serves as a core building block of the proposed CLFF framework and is designed to organize multi-level feature interaction and fusion within a unified architecture. Unlike traditional designs that treat feature fusion and refinement as separate processes, the MP-Block introduces a hierarchical architecture to jointly account for semantic consistency and structural continuity. Within this hierarchy, he MIFM acts as the primary engine for cross-layer interaction, while the Position-Aware Module (PAM) [27] provides complementary global enhancement.
As shown in Figure 2c, the MIFM constitutes the main fusion backbone of the MP-Block. Explicit interaction paths are set at various levels to allow for information to be gradually conveyed from high-level semantic representations to low-level structural features. This design preserves semantic distinctiveness and spatial alignment. Building upon this foundation, the position-aware module (PAM) [27] is incorporated as an auxiliary refinement component. Rather than functioning as an independent feature extractor, the PAM applies global spatial re-weighting after inter-layer feature fusion, enabling the network to suppress appearance-induced pseudo-changes (e.g., illumination variations) and further enhance object boundary delineation.
With the overall framework of the MP-Block defined, we next elaborate on the design of its core fusion architecture, i.e., the MIFM.
2.2.1. Multi-Branch Inter-Layer Fusion Module (MIFM)
The MP-Block is centered around the MIFM, which performs stepwise fusion of aligned features across adjacent semantic levels. At each fusion stage, the MIFM takes a high-level feature and its adjacent low-level feature as inputs and projects them into a unified feature space to obtain an initial fused representation, which serves as the basis for subsequent refinement.
On this basis, the MIFM is internally equipped with two complementary refinement branches, namely the RFRB and the ACGF unit. The RFRB is designed to enhance structural details and local contextual information, whereas the ACGF unit focuses on adaptively recalibrating channel-wise responses to suppress redundant information and emphasize change-relevant features.
The outputs of these two internal branches are subsequently aggregated to generate a refined fused feature at the current stage, which is then propagated to subsequent fusion stages and the final decoder. The detailed designs of the RFRB and the ACGF unit are presented in the following subsections.
2.2.2. Adaptive Channel-Group Fusion (ACGF) Unit
To alleviate semantic inconsistency and feature redundancy during inter-layer feature fusion, we employ the ACGF unit as a key component of the proposed MIFM. As illustrated in Figure 3, ACGF refines adjacent multi-level features by jointly enhancing spatial context and channel-wise responses in a lightweight and structured manner.
Given the two adjacent feature maps and with different spatial resolutions, is first upsampled to match the resolution of , and both features are then projected into a unified channel space through convolutions and element-wise summed to obtain a base feature representation,
where denotes bilinear upsampling.
Based on , ACGF adopts two parallel refinement branches with complementary roles. As shown in Figure 3, the upper branch focuses on spatial refinement. A GSG mechanism is applied within channel groups to model localized spatial dependencies and emphasize informative responses. Specifically, the base feature is evenly divided along the channel dimension into G groups, and the g-th group feature is denoted as . For each group, a convolution is applied to obtain the interaction response:
The response is then reshaped and normalized to generate a softmax-based gating weight map:
where denotes vectorization and is a small constant for numerical stability.
The gated feature is obtained by element-wise multiplication:
The outputs from all groups are concatenated and combined with a residual connection from , followed by normalization, resulting in the spatially refined feature .
The lower branch performs channel refinement. Specifically, following global average pooling on , an efficient channel attention module based on one-dimensional convolution, similar to ECA-Net [28], is employed to capture local cross-channel interactions and adaptively reweight channel responses, producing the channel-refined feature .
Finally, the outputs of the spatial and channel refinement branches are fused via element-wise addition to obtain the output feature of ACGF:
With this design, ACGF balances spatial discrimination and channel selectivity, enabling effective and robust inter-layer feature fusion.
2.2.3. Response-Aware Feature Refinement Block (RFRB)
The RFRB is introduced as a complementary branch of the MIFM to enhance fine-grained change representations and improve the robustness of inter-layer feature fusion. Instead of explicitly reconstructing features, RFRB performs response-aware refinement by selectively emphasizing informative feature components and enhancing weak responses, which is particularly beneficial for preserving subtle changes such as edges, textures, and small structural variations.
As illustrated in Figure 4, RFRB operates on adjacent feature maps from different semantic levels. Following the same feature alignment strategy as in ACGF, the higher-level feature is first upsampled to match the spatial resolution of the lower-level feature and then aggregated to form a base representation . Based on , RFRB derives adaptive channel-wise responses to guide subsequent feature refinement.
To obtain channel-wise importance cues, global average pooling followed by a Sigmoid activation is applied to , yielding a channel-wise response vector:
where denotes global average pooling along the spatial dimensions and represents the channel-wise response strength used to guide subsequent response-aware refinement.
In addition, channel response maps and are obtained from the aligned inter-layer features through batch normalization and Sigmoid activation, characterizing their relative activation strengths. Guided by these responses, feature components are implicitly separated into strong-response and weak-response parts and refined using different processing paths.
Highly responsive feature components are regarded as strong responses and refined using lightweight point-wise convolution to preserve their discriminative semantic information. In contrast, weakly activated components are selectively enhanced using depthwise separable convolution to strengthen local representations while reducing computational overhead. To further regulate weak feature responses, a lightweight channel-wise gating mechanism is introduced. Specifically, the refinement process of weak features can be formulated as follows:
Here, denotes the feature components exhibiting relatively weak responses and selected for refinement. The operator applies depthwise separable convolution to provide lightweight local enhancement, while the Softmax-based gating term assigns channel-wise weights to emphasize informative weak features.
The refined strong- and weak-response features are then combined via element-wise addition, yielding the output of RFRB:
The RFRB leverages adaptive channel-wise weighting and response-aware refinement, together with efficient feature fusion, to strengthen fine-grained change cues and improve multi-scale consistency, while incurring minimal additional computational overhead for enhanced change detection performance.
2.2.4. Prediction Head and Change Map Generation
The final fused feature has the highest spatial resolution among all feature representations after fusion and refinement, retaining multi-level change information while preserving fine-grained spatial details. A lightweight prediction head followed by upsampling to the input resolution is then applied to perform pixel-wise change inference:
where denotes the prediction head implemented as a final convolution layer, represents bilinear upsampling, and applying the Sigmoid function yields the final change probability map P.
3. Experimental Setup
A clarification regarding the definition of the term baseline is necessary, as it is used in two different contexts in this work. In the context of ablation studies, the baseline refers to a VGG16-BN backbone in which the proposed MP-Blocks are removed and replaced with standard convolutional layers. This configuration serves as a controlled reference to isolate the performance gains introduced by the proposed modules. In contrast, in quantitative benchmark comparisons, the term refers to external state-of-the-art methods against which CLFF is evaluated.
3.1. Datasets
To rigorously demonstrate the effectiveness of the proposed CLFF framework, particularly its capability to coordinate semantic and structural information, we employ four widely used public benchmark datasets from the literature, namely LEVIR-CD [29], WHU-CD [30], SYSU-CD [31], and HRCUS-CD [32]. These datasets are selected for their complementary characteristics in terms of spatial resolution, scene composition, and change complexity, thereby enabling a comprehensive evaluation of high-resolution remote sensing change detection performance across a wide range of urban scenarios.
Specifically, LEVIR-CD focuses on building-related changes with varying object scales under relatively structured urban layouts, making it suitable for evaluating scale-sensitive change detection performance. WHU-CD consists of very high-resolution aerial imagery with dense urban scenes and complex building boundaries, posing challenges for fine-grained structural delineation under varying illumination and occlusion conditions. In contrast, SYSU-CD and HRCUS-CD represent more complex urban environments with a broader range of change categories and higher scene heterogeneity, thereby emphasizing robustness to semantic ambiguity and cluttered backgrounds. Collectively, these datasets enable systematic evaluation across object- and scene-level change characteristics.
3.1.1. LEVIR-CD
Released by Beihang University in 2020, the LEVIR-CD [29] dataset has become a widely used benchmark for urban building change analysis in high-resolution remote sensing. It includes 637 bi-temporal RGB image pairs with a ground sampling distance of 0.5 m and a native spatial resolution of 1024 × 1024 pixels. Each pair is accompanied by a binary change annotation indicating building-related structural modifications. The dataset is split into 445 training, 64 validation, and 128 test samples. For efficient model training, all images are further divided into non-overlapping patches of 256 × 256, yielding more than 10,000 training instances. The samples are collected from multiple metropolitan areas across Texas, USA.
3.1.2. WHU-CD
We also employ the WHU-CD [30] dataset, developed by Wuhan University using high-resolution aerial imagery (0.3 m/pixel) of Christchurch, New Zealand, acquired in 2012 and 2016. The dataset provides annotated building change labels for pre- and post-earthquake urban analysis. The original images are cropped into patches and split into 6096 training pairs, 762 validation pairs, and 762 testing pairs.
3.1.3. SYSU-CD
The SYSU-CD [31] dataset was developed to support large-scale change detection in remote sensing applications. The images consist of 20,000 sets of bi-temporal RGB image pairs acquired between 2007 and 2014 in Hong Kong, covering seasonal and long-term urban changes. The images have a GSD of about 0.5 m, and they are normalized to a spatial size of . A large number of urban change patterns are covered, such as building construction and demolition, vegetation changes, road modifications, changes in maritime facilities, etc. Model development and verification use 12,000 data pairs for training, 4000 data pairs for validation, and 4000 data pairs for testing.
3.1.4. HRCUS-CD
HRCUS-CD [32] was published in 2023 by Zhang et al.; it mainly focuses on the urban area in Zhuhai, China, and its name stands for “High Resolution Change Detection Dataset”. There are 11,388 pairs of bi-temporal image patches which have a spatial resolution of pixels, with a GSD (ground sample distance) of 0.5 m; also, every pair has been annotated with the help of ground truth binary change masks to mark over 12,000 related building changes. The data span two time periods: urban areas were collected in 2019 and 2022, while rural or mountainous locations cover an older time frame between 2010–2018, capturing diverse urban expansion patterns across different time periods.
3.2. Evaluation Metrics
To assess the performance of the proposed model in remote sensing change detection, five commonly used quantitative metrics are adopted, including Precision (Pre), Recall, F1-score, Intersection over Union (IoU), and Pixel Accuracy (PA). Among these metrics, Precision and Recall jointly characterize the model’s ability to detect changes, while F1-score and IoU provide a balanced evaluation by simultaneously considering detection accuracy and spatial localization quality. Pixel Accuracy further reflects the overall classification correctness at the pixel level. The formal definitions of these metrics are given as follows:
In this evaluation setting, True Positives ( ) denote pixels that have undergone changes and are correctly identified by the model, whereas False Positives ( ) refer to unchanged pixels that are mistakenly classified as changed. Pixels that do not change and are correctly classified belong to the True Negative ( ) category. On the other hand, False Negatives ( ) represent the changed pixels that were not detected.
3.3. Implementation Details
To improve readability and reproducibility, the model configuration and key implementation settings are summarized in Table 1.
For quantitative comparison, competing methods are evaluated using their official implementations whenever available; otherwise, they are re-implemented according to the descriptions in the original papers. All methods are trained and tested under identical data splits, input resolutions, and evaluation protocols to ensure a fair comparison. We use AdamW [33] to optimize the model, and a two-stage learning rate schedule is adopted to ensure stable training.
4. Results
In this section, we present comprehensive experimental results to evaluate the effectiveness of the proposed CLFF for very high-resolution remote sensing change detection. Quantitative comparisons with state-of-the-art methods are first reported, followed by qualitative analyses to visually assess detection performance in complex scenes. Extensive ablation studies are then conducted to investigate the contributions of the MP-Block design, the RFRB component, the lightweight convolution strategy (DWConv + CWG), and different backbone networks. Finally, an efficiency analysis is conducted to evaluate the computational cost and practical applicability of the proposed framework.
4.1. Quantitative Comparison with State-of-the-Art Methods
To evaluate the actual performance of the proposed CLFF framework, we conduct quantitative comparisons with several representative remote sensing change detection methods. The compared approaches broadly fall into three categories, including CNN-based methods, attention-enhanced architectures, and Transformer-driven or global–local hybrid designs. To ensure the fairness and consistency of the comparison, all methods are evaluated under identical experimental settings and using the same evaluation metrics.
FC-EF [34] uses an early fusion strategy, where bi-temporal images are concatenated at the input stage and processed by a single fully convolutional encoder–decoder network.FC-Siam-Conc [34] is based on a Siamese architecture consisting of two parallel encoders. Features from corresponding stages are concatenated during decoding to enable temporal feature integration.FC-Siam-Diff [34] also adopts a parameter-sharing Siamese structure, in which change information is highlighted by computing the absolute difference between paired feature representations before decoding.IFNet [25] is a two-branch CNN-based network that enhances feature difference learning through deep supervision to better discriminate changed and unchanged regions.STANet [29] is an attention-based Siamese network that integrates spatiotemporal attention mechanisms to emphasize change regions and suppress irrelevant variations.BIT [18] introduces a Transformer-based paradigm into remote sensing change detection, enabling global feature interaction across bi-temporal representations through self-attention mechanisms.ChangeFormer [19] is a hierarchical Transformer-based method that captures multi-scale contextual information for accurate change detection.CGNet [26] is an attention-enhanced Siamese architecture that utilizes change magnitude-guided attention to focus on significant change regions.B2CNet [35] is a progressive refinement network that improves change boundary localization by gradually refining attention from coarse to fine regions.HyRet-Change [36] is a hybrid retention-based framework for global–local modeling in remote sensing change detection.
Quantitative results on the four datasets (LEVIR-CD, WHU-CD, SYSU-CD, and HRCUS-CD) are reported in Table 2 and Table 3. The best and second-best results are highlighted in bold.
The experimental results demonstrate that CLFF maintains robust and stable performance across datasets with varying spatial resolutions and scene complexities. On the LEVIR-CDdataset, characterized by densely distributed buildings and subtle structural variations, CLFF achieves the highest F1-score (91.89%) and IoU (84.99%) among all competing methods. This superiority indicates that the proposed cross-layer interaction mechanism effectively retains small-scale structural information often lost in deep layers, thereby enabling the accurate detection of subtle changes while preserving boundary integrity.
A similar advantage is observed on the WHU-CD dataset, where CLFF effectively handles large building footprints and complex urban layouts, ranking first across all four evaluation metrics (F1: 94.12%; IoU: 88.89%). This leading performance validates the efficacy of the structural refinement mechanism, which sharpens object boundaries and significantly reduces edge-related false alarms common in very high-resolution imagery.
On datasets with more diverse object categories, such as SYSU-CD and HRCUS-CD, CLFF continues to exhibit strong generalization capability. Notably, on the challenging HRCUS-CD dataset, CLFF attains an F1-score of 76.40% and an IoU of 61.81%, outperforming the second-best method by margins of 2.21% and 2.71%, respectively. These consistent improvements confirm the framework’s ability to mitigate semantic–structural mismatch, ensuring robust feature alignment even in complex scenes with significant appearance variations.
While CLFF delivers balanced F1 and IoU scores, it does not strictly dominate all metrics. For instance, on LEVIR-CD and HRCUS-CD, STANet [29] attains higher Recall due to a more aggressive detection strategy, albeit at the cost of reduced Precision. In contrast, CLFF maintains a robust trade-off between Precision and Recall. In practice, model selection depends on specific error tolerances: exhaustive discovery tasks may prioritize Recall, whereas automated monitoring typically favors Precision to minimize verification costs. The consistent F1 gains across diverse scenes confirm that CLFF offers a versatile solution that effectively balances these competing requirements.
4.2. Qualitative Comparison
This subsection presents qualitative comparisons on four public benchmark datasets to visually assess the effectiveness of the proposed method under diverse and challenging scenarios. These qualitative results further demonstrate that CLFF reduces semantic–structural mismatch and suppresses appearance-induced pseudo-changes across a wide range of challenging conditions.
4.2.1. LEVIR-CD Comparison
Figure 5 presents qualitative results on the LEVIR-CD dataset, which is characterized by significant variations in building scale and illumination conditions. As shown in Figure 5(1)–(3), baseline methods tend to produce fragmented predictions when dealing with large-scale buildings or densely distributed small objects. In the cluttered background of the city (Figure 5(2)), weak change targets cannot be detected due to the interference of surrounding structures.
Whereas other approaches often produce incomplete or morphologically inconsistent change maps, CLFF yields more coherent predictions by explicitly coordinating cross-layer information. Through the integration of high-level semantic representations with fine-grained structural details, the proposed framework enhances spatial detail recovery in the segmentation results. For instance, in the dense residential area shown in Figure 5(4), CLFF preserves clear separation between adjacent buildings and maintains boundary integrity, while competing methods tend to generate blurred contours or irregularly merged regions. These results suggest that the proposed interaction mechanism effectively captures long-range structural dependencies and maintains structural coherence at larger spatial scales.
4.2.2. WHU-CD Comparison
Figure 6 elucidates the qualitative comparative results obtained on the WHU-CD dataset, a benchmark characterized by ultra-high-resolution urban landscapes frequently plagued by dense vegetation occlusion and transient illumination fluctuations. As evidenced in the localized scenarios of Figure 6(1),(2), a multitude of established methodologies exhibit a propensity for false positives. These erroneous activations are generally triggered by changes in phenological vegetation or the movement of cast shadows, and the failure to suppress these errors ultimately leads to misjudgments on the peripheral boundaries of the buildings.
By explicitly separating structural changes from semantically irrelevant appearance variations, CLFF suppresses false detections. In scenarios involving large building footprints with homogeneous textures, CLFF produces more coherent and well-defined boundaries than conventional methods, as illustrated in Figure 6(4). Therefore, the proposed approach is able to effectively suppress fragmented contours and serrated edge artifacts that are commonly produced by competing methods. These qualitative observations further indicate that, even under substantial radiometric variations, CLFF improves spatial consistency by virtue of its coordinated cross-layer fusion mechanism, a property that is reflected in the overall conclusions of this study.
4.2.3. SYSU-CD Comparison
Figure 7 presents qualitative results on the SYSU-CD benchmark. This dataset is particularly suitable for qualitative evaluation due to the presence of diverse change categories and numerous long linear structures. As shown in Figure 7(1), when small targets are embedded within complex backgrounds, competing methods are prone to either miss the targets or produce fragmented and disconnected change maps. On the contrary, CLFF is able to take into account fine structural dissimilaties across different backgrounds and filter out background interference at the same time.
A notable advantage of the proposed CLFF emerges when handling the evolution of linear infrastructures, such as road expansions or wharf developments (Figure 7(2)–(4)). Whereas baseline architectures tend to output discontinuous contours mainly due to their limited contextual receptive fields, CLFF maintains strong morphological consistency and geometric continuity. These empirical observations indicate that the proposed cross-layer interaction scheme effectively coordinates global contextual perception with local structural information, resulting in more coherent and structurally consistent change representations—an ability that is critical for modeling elongated man-made structures. Moreover, under dynamic fluctuations of water surfaces (Figure 7(5)), CLFF demonstrates enhanced robustness to irregular background variations.
4.2.4. HRCUS-CD Comparison
Figure 8 illustrates the qualitative performance of CLFF on the HRCUS-CD dataset, which is characterized by highly cluttered scenes and fine-grained changes. In scenarios involving building construction within complex surroundings (Figure 8(2)) or appearance changes induced by roof repainting (Figure 8(5)), many state-of-the-art methods suffer from pronounced missed detections or false alarms.
On the contrary, CLFF makes more reasonable predictions, which are more logically consistent and have more distinct and uniform boundaries. Even when target regions are partially embedded in vegetation or affected by strong shadow interference (Figure 8(4)), the proposed method is able to reliably identify the change regions.
4.3. Ablation Study
Unless otherwise stated, the baseline configuration in all ablation studies refers to the same backbone network where the proposed MP-Block is replaced by a standard convolutional unit (i.e., Conv with BN and ReLU). For simplicity, this unit is abbreviated as “ Conv” in some tables.
4.3.1. Ablation on Stage-Wise Placement of MP-Blocks
To assess the efficiency of the suggested cross-layer interaction design, we carry out stage-wise ablation studies by inserting the MP-Block at various stages of the basic network, either separately or together. Different variants’ detailed configurations are listed below:
- Baseline: VGG16-BN backbone where all MP-Blocks are removed and replaced with standard Conv + BN + ReLU layers.
- Baseline + MP-3: MP-3 enabled.
- Baseline + MP-2: MP-2 enabled.
- Baseline + MP-1: MP-1 enabled.
- Baseline + MP-3 + MP-2: MP-3 and MP-2 enabled.
- Baseline + MP-3 + MP-1: MP-3 and MP-1 enabled.
- Baseline + MP-2 + MP-1: MP-2 and MP-1 enabled.
- CLFF (Full): MP-1, MP-2, and MP-3 all enabled.
Table 4 and Table 5 summarize that adding an MP-Block to any single level always improves performance compared to the base model on all datasets, which indicates that cross-layer interactions at various semantic levels are advantageous for change representation. For single level variants, we have already seen improvements with just one MP-Block which shows that it can help improve the interaction of features at that particular stage.
Table 4 and Table 5 show that inserting an MP-Block at any single stage consistently improves performance across datasets, indicating that cross-layer interactions at different semantic levels benefit change representation. Dual-stage configurations further improve F1-score and IoU compared with single-stage variants. The full CLFF, which deploys MP-Blocks at all three stages, achieves the best overall performance, demonstrating that multi-level collaboration is critical for simultaneously preserving semantic cues and fine-grained structures.
Some of the single- or dual-stage variants have also been found to achieve slightly higher Precision scores. But the complete CLFF model can obtain a much more even result, it keeps a lot of Recall and still has good Precision, so it obtains a better score overall. These results show that allowing for cross-layer interactions at every level causes more extensive change detection and greater overall dependability.
4.3.2. Visual Ablation of MP-Blocks at Different Feature Levels on the WHU-CD Dataset
Besides the quantitative analysis mentioned before, Figure 9 shows qualitative ablation results on the WHU-CD dataset to give a visual demonstration of how MP-Blocks affect features at various levels.
As we can see from the figure above, the prediction of the baseline model is fragmented with missing parts and the contour is broken. There is also obvious background noise in both the small object and large structure cases. The MP-Block is introduced on one feature level and already has an improvement in terms of structural integrity. And when using multiple levels of MP-Blocks, it improves the spatial coherence even more and reduces the amount of background noise.
When there are many MP-Blocks activated, the forecasted change maps become increasingly organized, and the entire CLFF model gets the best-formed shapes and clearest edges in various scenes, be they crowded neighborhoods or complicated rooftops. These findings show that working together at different levels is important for making things stronger and finding changes in big cities.
4.3.3. Ablation Study on the Internal Components of the MP-Block
To disentangle the individual contributions of the internal components within the MP-Block, a component-wise ablation study was performed on the WHU-CD dataset, with the quantitative results summarized in Table 6. The baseline configuration adopted a standard convolutional fusion layer, and individual components were subsequently introduced in an incremental manner.
Replacing the standard convolutional operation with the RFRB precipitates a measurable performance leap, wherein the F1-score escalates from 92.54% to 93.32%, accompanied by a concurrent rise in IoU from 86.11% to 87.48%. Such an increment underscores that the response-aware refinement mechanism effectively fortifies structural representations, thereby facilitating a more rigorous delineation of object boundaries. In parallel, deploying the ACGF unit in isolation enables the architecture to secure an F1-score of 93.69% and an IoU of 88.13%, bolstered by a conspicuous enhancement in precision metrics. This outcome implies that adaptive channel-group fusion serves as a potent filter for redundant channel-wise responses, effectively neutralizing spurious activations and curbing false-positive predictions. Cumulatively, these evaluative findings substantiate that the RFRB and ACGF units function as indispensable catalysts for sharpening structural precision and amplifying feature discriminability.
When RFRB and ACGF are integrated within MIFM, the model achieves improved performance, indicating that structural refinement and channel recalibration are complementary. This suggests that MIFM serves as the core fusion module where semantic and structural information are jointly coordinated across feature levels. To further analyze the weak-feature enhancement mechanism in RFRB, we conducted ablation experiments by replacing the enhancement branch with different convolutional and gating configurations.
The best all-around performance occurred with all of the MP-Blocks being used and the MIFM being enforced via the PAM. The PAM was incorporated as an auxiliary global enhancement module to complement the proposed cross-layer fusion strategy. With the inclusion of PAM, the model attained the highest F1 score of 94.12% and an IoU of 88.89%, primarily driven by a substantial improvement in recall. Overall, this hierarchical analysis demonstrates that the major performance gains primarily stem from the MIFM-based cross-layer interaction, while PAM provides additional global contextual enhancement.
4.3.4. Ablation Analysis of Weak Feature Enhancement in the RFRB
To investigate the weak-feature enhancement mechanism in RFRB, we conducted ablation experiments by replacing the enhancement branch with different convolutional and gating configurations. This analysis examines how alternative designs affect the refinement of weak responses in complex scenes.
To evaluate the effectiveness of the weak feature enhancement strategy in the Response-aware Feature Refinement Block (RFRB), we conducted a series of ablation experiments by replacing the enhancement branch with different convolutional and gating configurations, including convolution, convolution, depthwise separable convolution (DWConv) [37], group convolution (GConv, ) [38], and the proposed DWConv with channel-wise gating (CWG). The quantitative results on the WHU-CD dataset are summarized in Table 7.
Among the compared variants, the convolution achieved the highest F1 score (94.21%) and IoU (89.05%), yet this gain was accompanied by a noticeable increase in model complexity, with 30.37 M parameters and 62.85 G FLOPs. By comparison, DWConv and GConv markedly reduced computational cost (27.27 M/61.20 G and 27.69 M/62.17 G), but the associated performance gains remained limited. This shows that simply having few Convolutional Filters is not good for getting Weak Change to Work.
When channel-wise gating was introduced, the DWConv + CWG (Softmax) variant further improved the F1-score to 94.12% and the recall to 93.94%, while maintaining nearly unchanged computational complexity (27.62 M parameters and 61.20 G FLOPs). And this tells us that the better results we got were not from having some way to increase model power, but from rebalancing the features we already had. A comparison between softmax- and sigmoid-based gating functions reveals distinct behaviors. The F1-score is about the same (94.04%), but the precision is a bit higher (94.69%), while the softmax-based gate generally has higher recall, which helps to recover the full change regions. This difference can be attributed to the competitive normalization property of softmax, in contrast to the independent channel-wise modulation mechanism employed by sigmoid-based attention [39,40]. Accordingly, the softmax-based CWG is adopted as the default configuration in this work.
Overall, the results show that the proposed weak feature enhancement strategy is able to effectively strengthen the subtle change response in a lightweight and representation-driven manner. More importantly, explicitly rebalancing the weak and strong features is more important than simply increasing the capacity of convolution.
4.3.5. Backbone-Wise Evaluation of the MP-Block
To verify the generalization capability of the proposed method across different network architectures, we evaluated the MP-Block on three representative backbone networks, namely ResNet18 [41], ResNet50 [41], and VGG16 [42]. In practical remote sensing applications, choosing a backbone is often limited by the computational budget, so it is necessary to have an interaction design that is flexible and architecture-agnostic.
Experimental settings for fair comparison. VGG16 is adopted as the primary backbone configuration. For ResNet18, the standard feature outputs are directly utilized without any additional modification. In contrast, ResNet50 produces feature maps with substantially larger channel dimensions, reaching up to 2048 channels at the deepest stage, which differs markedly from VGG16. In order to ensure the fairness of the comparison in terms of computational complexity and parameter efficiency, a channel compression strategy was adopted for ResNet50. Specifically, a convolution was inserted at each stage to reduce the channel dimensions by a factor of two (e.g., from 2048 to 1024 at the deepest layer) before the features were fed into the MP-Block.
Table 8 shows a universal pattern. The MP-Block still works well on different backbone networks. Specifically, when applied to the lightweight ResNet18, the module improves both F1-score and IoU while simultaneously reducing parameter count and FLOPs, demonstrating that the proposed cross-layer interaction mechanism maintains strong performance even under strict model capacity constraints. The MP-Block scales effectively to deeper backbones such as ResNet50, while VGG16 achieves the best overall performance owing to its dense multi-level feature representations.
Consistent performance gains are observed across backbones with different depths and architectural designs, demonstrating that the effectiveness of CLFF is not dependent on a specific encoder. This result indicates that semantic–structural misalignment is a common issue in CNN-based backbones. By introducing hierarchical cross-layer interaction, the MP-Block consistently mitigates this misalignment in both shallow (ResNet18) and deep (ResNet50) networks, confirming its role as a backbone-agnostic structural refinement module.
4.4. Comparison of Model Efficiency
Table 9 compares the computational efficiency of representative change detection models on the WHU-CD dataset with respect to parameter size, FLOPs, and training time per epoch. All experiments were carried out on the same hardware setup to ensure a fair comparison.
As shown in Table 9, the proposed CLFF contains 27.62 M parameters and requires 61.20 G FLOPs, which are substantially lower than those of most VGG16-based methods such as IFNet and CGNet, while still achieving comparable or better detection accuracy. Compared with transformer-based methods such as ChangeFormer, CLFF exhibits a much lower computational cost and is therefore more suitable for resource-constrained scenarios.
The training time per epoch for CLFF is 91.45 s, which is slightly slower than some lightweight CNN-based models, but significantly faster than heavy Transformer-based approaches. This indicates that the introduction of additional overhead by the proposed cross-layer fusion design is limited, but performance improvement is evident.
To conclude, CLFF achieves a favorable balance between accuracy and efficiency. Its moderate parameter scale and computational cost, together with fast training speed, indicate that the proposed framework is well suited for practical high-resolution remote sensing change detection applications.
5. Discussion
Experiments on four benchmark datasets confirm that explicitly modeling cross-layer interactions is crucial for effective change detection in very high-resolution remote sensing imagery. By enabling direct interaction between high-level semantic representations and fine-grained structural details, CLFF effectively overcomes the semantic–structural separation inherent in hierarchical feature representations. In cluttered scenes, the MP-Block couples noise-sensitive shallow features with semantically robust deep features, allowing for semantic cues to guide structural reconstruction. This design leads to clearer boundaries and more complete extraction of change regions, particularly in complex urban environments.
The behavior of the RFRB also points out the need for selective structural improvement. In VHR images, many changes are located in a very small part of the picture and do not have much difference from the parts around them, so they can become hidden easily when there is a bigger pattern in the background. Response-aware refinement enhances the subtle change cues and suppresses the redundant features, and the ablation results show that this targeted enhancement is better than simply increasing the number of convolutions.
Although the results are promising, there are still some limitations. In extremely cluttered or occluded scenes, small false detections may still happen, and it is difficult to conduct real-time processing on ultra-high-resolution images. These observations encourage further investigation into better context modeling, lighter architecture design, and multi-modal info fusion.
6. Conclusions
In this paper, we proposed CLFF, a cross-layer feature fusion framework for change detection in very high-resolution remote sensing images. The core contribution of this work lies in explicitly organizing hierarchical interactions between encoder features through the proposed MP-Block, rather than relying on isolated feature enhancement or implicit fusion. By coordinating semantic abstraction and structural detail across layers, CLFF provides a principled solution to the semantic–structural misalignment commonly observed in very high-resolution change detection. Comparative and ablation studies on four public VHR datasets consistently verify the effectiveness of CLFF and highlight the essential contribution of explicit feature fusion to performance improvement.
Despite its effectiveness, the proposed method still exhibits certain limitations. In extremely cluttered or heavily occluded scenarios, small false detections may still persist, and processing ultra-high-resolution imagery remains computationally demanding for real-time deployment. These limitations suggest that there remains room for further improvement in both robustness and efficiency.
Future work will focus on developing more lightweight architectural variants and extending the proposed interaction mechanism to more challenging settings, such as multi-modal fusion and cross-domain adaptation, with the goal of improving generalization capability and deployment practicality under real-world constraints.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Li H.L. Chen S.W. General polarimetric correlation pattern: A visualization and characterization tool for target joint-domain scattering mechanisms investigation IEEE Trans. Geosci. Remote Sens.202664520041710.1109/tgrs.2025.3647123 · doi ↗
- 2Li H.L. Chen S.W. Polyhedral corner reflectors multidomain joint characterization with fully polarimetric radar IEEE Trans. Antennas Propag.202573106791069310.1109/tap.2025.3608033 · doi ↗
- 3Li H.L. Liu S.W. Chen S.W. Pol SAR ship characterization and robust detection at different grazing angles with polarimetric roll-invariant features IEEE Trans. Geosci. Remote Sens.202462522581810.1109/TGRS.2024.3474702 · doi ↗
- 4Singh A. Review article: Digital change detection techniques using remotely sensed data Int. J. Remote Sens.198910989100310.1080/01431168908903939 · doi ↗
- 5Gao F. Wang X. Gao Y. Dong J. Wang S. Sea ice change detection in SAR images based on convolutional-wavelet neural networks IEEE Geosci. Remote Sens. Lett.2019161240124410.1109/LGRS.2019.2895656 · doi ↗
- 6Mishra P.K. Rai A. Rai S.C. Land use and land cover change detection using geospatial techniques in the Sikkim Himalaya, India Egypt. J. Remote Sens. Space Sci.20202313314310.1016/j.ejrs.2019.02.001 · doi ↗
- 7Brunner D. Bruzzone L. Lemoine G. Change detection for earthquake damage assessment in built-up areas using very high resolution optical and SAR imagery Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS)Honolulu, HI, USA 25–30 July 20103210321310.1109/IGARSS.2010.5651416 · doi ↗
- 8Lei T. Xu Y. Ning H. Lv Z. Min C. Jin Y. Nandi A.K. Lightweight structure-aware transformer network for remote sensing image change detection IEEE Geosci. Remote Sens. Lett.202421600030510.1109/LGRS.2023.3323534 · doi ↗
