A Multi-Modal Decision-Level Fusion Framework for Hypervelocity Impact Damage Classification in Spacecraft

Kuo Zhang; Chun Yin; Pengju Kuang; Xuegang Huang; Xiao Peng

PMC · DOI:10.3390/s26030969·February 2, 2026

A Multi-Modal Decision-Level Fusion Framework for Hypervelocity Impact Damage Classification in Spacecraft

Kuo Zhang, Chun Yin, Pengju Kuang, Xuegang Huang, Xiao Peng

PDF

Open Access

TL;DR

This paper introduces a new framework that combines multiple data sources to accurately detect and classify damage on spacecraft from high-speed impacts.

Contribution

The novel framework integrates infrared and vibration data with a dual-stream ResNet and D-S theory for decision-level fusion in HVI damage classification.

Findings

01

The proposed method achieves 99.01% mean accuracy, outperforming unimodal approaches.

02

It corrects misclassifications in micro-cracks and perforation with over 96.9% precision.

03

The method shows high stability with a standard deviation of 0.74%.

Abstract

During on-orbit service, spacecraft are subjected to hypervelocity impacts (HVIs) from micrometeoroids and space debris, causing diverse damage types that challenge structural health assessment. Unimodal approaches often struggle with similar damage patterns due to mechanical noise and imaging distance variations. To overcome these physical limitations, this study proposes a physics-informed multimodal fusion framework. Innovatively, we integrate a distance-aware infrared enhancement strategy with vibration spectral subtraction to align heterogeneous data qualities while employing a dual-stream ResNet coupled with Dempster–Shafer (D-S) evidence theory to rigorously resolve inter-modal conflicts at the decision level. Experimental results demonstrate that the proposed strategy achieves a mean accuracy of 99.01%, significantly outperforming unimodal baselines (92.96% and 97.11%). Notably,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

on-orbit damage Hypervelocity Impact Damage

Figures1

Click any figure to enlarge with its caption.

Overall flowchart of the vibration-infrared multimodal fusion classification method.

Funding2

—National Natural Science Foundation of China
—Sichuan Provincial Science and Technology Support Program

Keywords

hypervelocity impactdamage classificationmultimodal fusiondeep residual networkdecision-level fusion

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpace Satellite Systems and Control · Structural Health Monitoring Techniques · Rock Mechanics and Modeling

Full text

1. Introduction

During the in-orbit operation of spacecraft, it is impossible to avoid the hypervelocity impact (HVI) caused by micrometeoroids and space debris (MMOD) [1,2,3]. Such impacts typically act on the surface and internal units of structures at extremely high incident speeds and transient kinetic energy, causing a wide variety of damages (such as craters, cracks, perforations, etc.) with a large scale range. Moreover, the development patterns and maintenance strategies of different damage types vary significantly [4,5,6,7]. Therefore, precise detection of HVI damage is fundamental, and the development of classification technologies capable of accurately distinguishing different types of damage has become a key technical requirement for supporting the long-term stable service of spacecraft [8,9]. In this context, detection methods based on vibration response analysis and infrared thermography offer promising technical paths to address these challenges, due to their advantages of non-contact, remote, and full-field coverage [10,11,12,13]. Katam et al. [10] proposed a damage detection method based on STFT time–frequency feature extraction and autoencoder dimensionality reduction, achieving high-accuracy damage classification under limited data conditions through time–frequency analysis of vibration signals. Zhang et al. [11] introduced a vibration-based damage detection approach utilizing phase-based motion estimation and convolutional neural networks, which enables precise localization of bolt looseness damage with single-sample training via pixel-level vibration signal extraction from videos. Gao et al. [12] proposed a hypervelocity impact (HVI) damage detection method based on infrared data extraction and multi-objective optimization algorithms. By extracting representative infrared features and reconstructing defect regions, the method effectively improved detection accuracy and efficiency. Yin et al. [13] introduced the Dynamic Multi-Objective Feature Extraction Optimization (DM-FEO) method, which enhances the precision of HVI damage detection through temperature point extraction in infrared thermography and multi-directional prediction algorithms, successfully distinguishing the thermal characteristics of different damage types. Vibration detection is sensitive to changes in the structural frequency response function, making it effective for identifying local stiffness degradation and delamination damage, as well as macro-scale defects [14]. Recent research has further demonstrated that coupling finite element models with metaheuristic optimization algorithms can significantly enhance the precision of damage localization in composite structures [15]. Wang et al. [16] comprehensively reviewed the physical principles, excitation modalities, and applications of multimode infrared thermal-wave imaging in non-destructive testing. In particular, the integration of active infrared imaging with deep learning techniques has been identified as a highly accurate solution for defect detection in complex industrial components [17]. However, it is important to note that these methods primarily serve the detection and preliminary identification of “damage or no damage.” When the task objective shifts to “high-precision classification” of damage types, the inherent limitations of these unimodal methods become evident.

Accurate damage classification of spacecraft, such as identifying crack, perforation, and delamination, is crucial for supporting precision on-orbit maintenance. This task faces three major challenges: first, different damage types exhibit high feature similarity under a single modality, making them difficult to distinguish accurately; second, vibration signals are susceptible to noise interference while infrared features are influenced by imaging distance, leading to poor model stability—a problem also prominent in complex industrial settings, as Tian et al. [18] demonstrated that variable working conditions hinder stable feature extraction; third, the scarcity of annotated on-orbit samples coupled with class imbalance limits the training effectiveness of data-driven models, prompting Yin et al. [19] to employ generative adversarial networks to enhance scarce sample representation. Thus, relying on a single modality fails to achieve both comprehensive coverage and discriminative robustness, particularly under sample-limited conditions required for high-accuracy classification.

To overcome the limitations of unimodal methods, recent research has shifted toward multimodal data fusion for detection [20,21]. These studies demonstrate that integrating complementary multimodal information is an effective strategy for improving performance in complex tasks. However, existing fusion methods primarily focus on feature-level or data-level fusion, requiring precise alignment of heterogeneous data, and they do not explicitly address the critical issue of uncertainty in classification tasks. As a result, their effectiveness is limited when handling challenges such as distinguishing highly similar damage types in fine-grained classification. Data-driven automatic feature learning and classification provide a new technological path for HVI damage detection [22,23,24]. Models, particularly convolutional neural networks (CNNs) and residual networks (ResNet), can learn discriminative features across scales and forms in an end-to-end manner, alleviating the degradation of deep models and enhancing training stability [25]. Ye et al. [26] developed a multi-sensor residual fusion network utilizing double-ring residual and global interactive modules to achieve robust diagnosis under noisy and small-sample conditions; Le-Xuan et al. [27] constructed a hybrid 1DCNN-LSTM-ResNet architecture that effectively captures long-term temporal dependencies and enhances damage detection accuracy. Yan et al. [28] proposed a QCNN-based method for bearing fault diagnosis by fusing audio and vibration signals for high-precision detection in noisy environments.

Furthermore, the combination of multimodality and deep learning shows even better prospects. Xu et al. [29] proposed a multimodal neural network fusion method that integrates large-kernel networks with an attention mechanism, achieving highly robust damage identification in noisy environments. Meshram et al. [30] designed an integrated model utilizing multimodal transformers and hybrid deep learning to optimize pothole detection and repair in diverse environments. Cao et al. [31] proposed a multimodal feature fusion model combining ResNet and GRU to distinguish pseudo defects in ultra-thick stainless-steel welds using Phased Array Ultrasonic Testing. Peng et al. [32] developed a multimodal hybrid neural network, significantly improving damage classification accuracy and data efficiency under complex operational conditions. Research by Zaman et al. [33] and Chen et al. [34] also demonstrated the advantages of multimodal fusion in fault diagnosis. However, the performance of these methods is highly dependent on the scale and quality of training data. In HVI damage scenarios, real labeled data is scarce, especially for infrared images, where image quality is highly dependent on the shooting distance. Distance uncertainty leads to feature distribution shifts, which has become a major bottleneck in applying deep learning methods.

This study aims to establish a robust multimodal fusion framework for high-precision HVI damage classification. Key objectives include: (1) overcoming unimodal physical limitations, such as vibration insensitivity and thermal blurring; (2) mitigating feature degradation caused by noise and variable imaging distances; and (3) resolving inter-modal conflicts via uncertainty quantification for reliable diagnosis. By integrating physics-informed signal enhancement with evidence-theoretic fusion, this work provides an accurate solution for on-orbit structural health monitoring. As depicted in Figure 1, the proposed framework targets the challenge of high-precision HVI damage classification. Key contributions include:

A vibration-infrared multimodal fusion framework that systematically integrates vibration time–frequency analysis with infrared distance-aware enhancement, providing a robust solution for feature extraction in variable environments.
A dual-stream deep residual network architecture developed to extract and fuse modality-specific features. It combines spectral subtraction and STFT for vibration signals and utilizes hierarchical enhancement to address infrared feature shifts caused by imaging distance variations.
A decision-level fusion method based on D-S evidence theory. By quantifying epistemic uncertainty, this approach suppresses inter-modal conflicts and maximizes complementary advantages, ensuring reliable classification for spacecraft maintenance.

Overall flowchart of the vibration-infrared multimodal fusion classification method.

2. Problem Statement

Spacecraft inevitably encounter hypervelocity impacts (HVIs) from micrometeoroids and orbital debris (MMOD). To ensure survivability, mission requirements extend beyond detection to the precise classification of damage types (e.g., cracks, delamination, or perforation) for targeted maintenance. However, transitioning to high-precision classification faces fundamental challenges of intrinsic ambiguity and feature aliasing. As illustrated in Figure 2, these challenges propagate from the physical perception layer to the decision layer.

First, signal degradation in harsh environments limits sensing reliability. Vibration responses, critical for identifying micro-damage, are often masked by broadband mechanical noise, resulting in a low Signal-to-Noise Ratio (SNR) that renders time-domain analysis ineffective. Similarly, infrared feature representation is highly sensitive to imaging distance. Variations in detection distance cause severe “domain shift” in thermal feature distributions, hindering model generalization under variable operating conditions.

Second, data heterogeneity and sample scarcity create a “semantic gap” in feature learning. Synergizing 1D temporal vibration signals (global stiffness) with 2D thermal images (local flow blockage) requires aligning incompatible geometric manifolds. Furthermore, real-world data exhibits an extreme long-tail distribution: scarce high-risk samples (e.g., perforations) are overwhelmed by abundant minor damage samples. This imbalance biases decision boundaries toward the majority class, leading to critical missed detections.

Finally, sensor conflicts compromise deterministic decision-making. In complex scenarios, a “Paradox State” arises when sensors yield contradictory diagnoses (e.g., internal delamination triggering vibration anomalies while remaining thermally invisible). In such cases, traditional hard-voting mechanisms fail. Therefore, establishing a mathematical framework to quantify “epistemic uncertainty” is imperative to manage conflicts and preserve high-confidence evidence.

To systematically address these challenges—signal degradation (Physical Layer), feature alignment gaps (Feature Layer), and evidential conflicts (Decision Layer)—this paper proposes a multimodal fusion framework (Figure 2, right panel) that achieves robust classification through multi-level processing mechanisms. To satisfy on-orbit real-time constraints, the framework prioritizes computational efficiency through a lightweight design. By adopting decision-level fusion over high-dimensional feature concatenation, computational overhead is minimized. The decoupled dual-stream topology supports hardware parallelism, ensuring system latency is determined by the longest single branch rather than cumulative processing time. Additionally, input dimension optimization reduces floating-point operations (FLOPs), guaranteeing rapid response capabilities for sudden impacts.

3. Methodology Description

3.1. Vibration Signal Enhancement and Time–Frequency Representation

Under ultrasonic broadband excitation, vibrational responses are critical for damage identification but are often compromised by low Signal-to-Noise Ratios ( $[eqn]$ ) due to environmental interference. To recover transient damage features, a spectral subtraction algorithm is employed. Crucially, this method suppresses additive noise while preserving original phase information, ensuring the temporal integrity required for accurate time–frequency analysis. The implementation principle is illustrated in Figure 3.

The observed signal $[eqn]$ is modeled as a superposition of the true system response $[eqn]$ and additive noise $[eqn]$ :

[eqn]

where N is the signal length. Applying the Short-Time Fourier Transform (STFT) transforms the signal into the frequency domain:

[eqn]

Here, Z, $[eqn]$ , and V correspond to the frequency representations of the noisy signal, true response, and noise, respectively. These components are decomposed into magnitude and phase:

[eqn]

[eqn]

where $[eqn]$ denotes magnitude and $[eqn]$ denotes phase.

To recover the damage-related features, Spectral Subtraction is employed. The noise magnitude spectrum $[eqn]$ is estimated from baseline non-excited signals. The denoised amplitude spectrum $[eqn]$ is then obtained by subtracting the noise floor:

[eqn]

The time-domain signal $[eqn]$ is reconstructed via the Inverse STFT (ISTFT) by combining the denoised magnitude with the original noisy phase $[eqn]$ , preserving the temporal characteristics of the impact:

[eqn]

To standardize inputs for deep learning and expand the dataset, an overlapping sliding window strategy is applied to the denoised sequence $[eqn]$ . Given a window length $[eqn]$ and step size $[eqn]$ , the overlap ratio $[eqn]$ is defined as:

[eqn]

The l-th signal segment $[eqn]$ is extracted as follows:

[eqn]

The total number of segments $[eqn]$ is calculated by:

[eqn]

In this study, parameters are set to $[eqn]$ and $[eqn]$ ( $[eqn]$ ) to capture local features while ensuring continuity. Each segment is assigned a damage label $[eqn]$ , constructing the labeled vibration dataset:

[eqn]

Finally, $[eqn]$ is partitioned into training and testing sets (8:2 ratio) using stratified sampling to maintain consistent class distribution and prevent overfitting.

3.2. Distance-Aware Data Augmentation for Infrared Images

Infrared imaging quality in spacecraft NDT degrades significantly with varying detection distances. To address this, a distance-aware data augmentation strategy is proposed to physically simulate degradation across near-, mid-, and far-field regions. This approach employs hierarchical enhancement functions $[eqn]$ to model distinct degradation patterns: $[eqn]$ enhances high-resolution detail separability; $[eqn]$ applies geometric transformations for inter-class discriminability; and $[eqn]$ compensates for resolution loss and noise.

The enhancement logic is formally defined in Equation (11), where d denotes the simulated distance, and $[eqn]$ represent the near- and far-field thresholds, respectively.

[eqn]

The complete data generation process is summarized in Algorithm 1. By generating multi-distance samples consistent with physical variations, this method effectively improves model robustness under diverse imaging conditions. Algorithm 1 Distance-Aware Infrared Data Augmentation

Require: Raw infrared images $[eqn]$ , Augmentation factor $[eqn]$ , Thresholds $[eqn]$
Ensure: Augmented dataset $[eqn]$ , split into $[eqn]$
1:Initialize $[eqn]$
2:for each image $[eqn]$ do
3: for $[eqn]$ to $[eqn]$ do
4: Generate simulated distance $[eqn]$
5: Determine Category $[eqn]$ based on Equation (11):
6: if $[eqn]$ then $[eqn]$
7: else if $[eqn]$ then $[eqn]$
8: else $[eqn]$
9: end if
10: Apply Transformation: $[eqn]$
11: $[eqn]$
12: end for
13:end for
14:Partition: Split $[eqn]$ into $[eqn]$ (80%) and $[eqn]$ (20%) using stratified sampling
15:return $[eqn]$

3.3. Multi-Modal Feature Extraction via a Dual-Stream ResNet

Building on the preprocessed datasets $[eqn]$ and $[eqn]$ , we design a dual-stream enhanced ResNet to extract hierarchical features. ResNet-50 is selected as the backbone for its optimal balance between feature abstraction and convergence efficiency under limited data. Its residual learning framework effectively mitigates gradient vanishing by reformulating the mapping $[eqn]$ as learning a residual function $[eqn]$ :

[eqn]

where x and $[eqn]$ denote the input and trainable parameters, respectively. As shown in Figure 4, the architecture employs two parallel, weight-independent streams to preserve modality-specific representations.

To enhance feature discriminability, a Channel Attention Mechanism is integrated into the ResNet Bottleneck. For an input feature map $[eqn]$ , attention weights o are computed via a dual-path pooling strategy comprising Global Average Pooling (GAP) and Global Max Pooling (GMP):

[eqn]

[eqn]

[eqn]

Here, $[eqn]$ and $[eqn]$ are learned weights with a reduction ratio $[eqn]$ , while $[eqn]$ and $[eqn]$ denote ReLU and Sigmoid functions. The feature map is adaptively refined via element-wise multiplication:

[eqn]

Through this element-wise multiplication (⊙), the mechanism adaptively highlights channels critical for damage characterization while suppressing background noise.

To address sample scarcity (particularly in vibration data), a weighted decision layer is introduced to dynamically adjust decision boundaries. The final probability $[eqn]$ fuses a base classifier with an adjustment term:

[eqn]

[eqn]

[eqn]

where x is the high-level feature vector. The adjustment layer $[eqn]$ (initialized to zero) specifically learns compensatory weights for minority classes, ensuring unbiased optimization during training.

Crucially, the dual-stream architecture outputs instance-level probability vectors for the subsequent fusion stage. By processing synchronized single-frame inputs, the network yields Softmax outputs ( $[eqn]$ and $[eqn]$ ) representing class probabilities. These vectors, distinct from intermediate feature maps or temporal sequences, serve as the precise inputs for the Dempster–Shafer fusion module.

3.4. Multi-Modal Decision Fusion Based on D-S Evidence Theory

To address the varying reliability of vibration and infrared modalities, we employ Dempster–Shafer (D-S) evidence theory for decision-level fusion (Figure 5). This framework explicitly models epistemic uncertainty, allowing for robust conflict resolution between heterogeneous data sources.

The decision framework is defined over a set of mutually exclusive damage states $[eqn]$ . The core mechanism involves constructing Basic Probability Assignments (BPAs) from network Softmax outputs and fusing them via Dempster’s combination rule. The mathematical definition for combining two independent evidences ( $[eqn]$ ) is given by the orthogonal sum:

[eqn]

where $[eqn]$ represents the conflict coefficient. This rule is associative and can be applied iteratively to fuse multiple evidence sources from both modalities, as outlined in Algorithm 2. Algorithm 2 Multi-modal Fusion Decision Method Based on D-S Evidence TheoryRequire: Vibration Probabilities: $[eqn]$ for $[eqn]$ Infrared Probabilities: $[eqn]$ for $[eqn]$

Ensure: Final decision class $[eqn]$
1:Step 1: BPA Construction
2:Map probabilities to BPAs for all sensors/frames:
3: $[eqn]$
4: $[eqn]$
5:Step 2: Evidence Combination
6:Fuse all evidence bodies iteratively using Dempster’s rule (Equation (20)):
7: $[eqn]$
8:Step 3: Decision Making
9:Select class with maximum belief:
10: $[eqn]$
11:return $[eqn]$

4. Results and Discussion

4.1. Experimental Setup and Dataset Construction

4.1.1. Preparation and Characterization of HVI Damaged Specimens

This study focuses on five typical spacecraft damage types: cracks, craters, delamination, ablation, and perforation, unlike simulation-based studies, this research relies on high-fidelity physical data obtained from a professional two-stage light gas gun ballistic range. To accurately simulate the structural response of modern spacecraft thermal protection systems, the target specimens were manufactured from Carbon Fiber Reinforced Carbon (C/C) composites ( $[eqn]$ ). Spherical aluminum projectiles with a diameter of 3.0 mm were employed to impact these targets at velocities ranging from 1.5 km/s to 5.0 km/s in a vacuum environment.

High-velocity penetration generated craters and perforations, while shock wave propagation induced internal delamination and cracks. To ensure label reliability, a “physical measurement + expert verification” protocol was implemented. Samples were quantified via optical microscopy (e.g., typical craters: 10.3–10.7 mm diameter) and independently validated by two experts to establish the Ground Truth.

4.1.2. Multimodal Data Acquisition and Preprocessing Strategy

The system architecture and operational workflow of the experimental platform are illustrated in Figure 6. The infrared thermography system captures the specimen’s two-dimensional temperature distribution, yielding thermal images with a resolution of $[eqn]$ pixels. Concurrently, piezoelectric accelerometers record surface vibration signals at a sampling rate of 250 kHz. To ensure the validity of multimodal fusion, a strict sequential acquisition protocol was implemented. Since HVI damage constitutes a permanent plastic deformation with static stability, strict temporal synchronization is not required. By sequentially capturing thermal and vibration data from the same damaged specimen, physical state consistency is maintained, ensuring that both modalities characterize the identical damage event for decision-level fusion.

In the preprocessing stage, specific strategies were employed to enhance signal quality and standardize inputs. Regarding the sensitivity to non-stationary noise, although standard spectral subtraction assumes stationarity, our method remains robust due to the extremely short duration of HVI response. Within this micro-to-millisecond scale, the statistical properties of background noise (such as thermal noise and mechanical micro-vibrations) can be approximated as quasi-stationary. Consequently, the signal processing is conducted within STFT frames (256 points, approx. 1 ms), where noise properties remain stable. Furthermore, the subsequent ResNet backbone focuses on the topological structure of energy ridges rather than absolute pixel amplitudes, providing feature-level tolerance to residual noise. The signals were converted into time–frequency spectrograms using a Hanning window of length 256, a hop size of 64, and 256 FFT points, followed by a logarithmic amplitude scale and robust Min-Max normalization (based on 5th and 95th percentiles). Subsequently, both inputs were resized to $[eqn]$ pixels. This resolution was selected to balance accuracy and efficiency: (1) It adapts to constrained on-orbit computing resources by reducing FLOPs. (2) It preserves sufficient physical fidelity, specifically the low-frequency energy ridges in vibration spectra and thermal gradients in infrared images. (3) It ensures network compatibility, as the size is divisible by the ResNet-50 downsampling factor, maintaining spatial resolution in the final feature maps.

To validate the proposed method, a multi-modal dataset was constructed covering six states: Class 0 (Normal), Class 1 (Crack), Class 2 (Crater), Class 3 (Delamination), Class 4 (Ablation), and Class 5 (Perforation). As shown in Table 1, the dataset exhibits inherent class imbalance due to the physical rarity of high-velocity impact samples. To establish robustness under this imbalance, we introduced a class-weighted Cross-Entropy loss mechanism to assign higher gradient weights to rare samples. Simultaneously, strictly regularized optimization strategies, including the AdamW optimizer with weight decay and a Cosine Annealing learning rate schedule, were implemented to restrict the model from falling into sharp local minima and prevent overfitting. Crucially, to prevent data leakage, an event-based splitting strategy was implemented instead of random shuffling. Samples were partitioned based on the unique Experiment ID at a ratio of 7:2:1. This ensures that the test set consists of completely unseen physical events, effectively assessing the model’s true generalization capability.

4.1.3. Hyperparameter Settings and Evaluation Metrics

Experiments were conducted on a Windows 10 workstation equipped with an Intel i5-12400 CPU, 16 GB RAM, and an NVIDIA GTX 1650 SUPER GPU. All algorithms were implemented using Python 3.7 and PyTorch 1.7.1. To ensure reproducibility, the specific network training hyperparameters (including the AdamW optimizer and Cosine Annealing strategy) are summarized in the upper section of Table 2.

To comprehensively quantify the model performance, a multi-dimensional evaluation metric system was established, as mathematically defined in the lower section of Table 2. First, to evaluate the efficacy of the proposed vibration spectrogram preprocessing, we introduced a custom feature separability score ( $[eqn]$ ). This metric quantifies the statistical contrast between the signal energy ridges and the background noise, providing a direct measure of signal-to-noise ratio enhancement. Regarding classification performance, relying solely on Accuracy is insufficient due to the inherent class imbalance in HVI datasets. Therefore, we incorporated Precision, Recall, and F1-Score to analyze class-specific performance. More importantly, to assess global robustness, we employed Macro-F1 and Cohen’s Kappa coefficient. Macro-F1 calculates the unweighted average across all categories, ensuring that minority classes (e.g., Perforation) contribute equally to the final score. Cohen’s Kappa evaluates the agreement between predictions and ground truth after removing the possibility of chance agreement, thereby verifying that the high accuracy is not a result of statistical bias toward majority classes.

4.2. Quantitative Validation of Signal Reconstruction and Enhancement Modules

This section quantitatively verifies the independent contributions of the proposed preprocessing modules—Spectral Subtraction (SS) for vibration and Distance-Aware Augmentation (DA) for infrared—ensuring that the high performance of the fusion system stems from high-quality unimodal features rather than the masking effect of the decision mechanism.

4.2.1. Efficacy of Spectral Subtraction Denoising

To visualize the signal reconstruction capability, Figure 7 presents the frequency-domain analysis. Original signals (blue) are heavily contaminated by environmental noise in the 60–70 kHz band, masking the true damage signatures. The proposed spectral subtraction (orange) effectively suppresses this noise floor while preserving the discriminative spectral peaks at 30 kHz and 55 kHz, significantly improving the Signal-to-Noise Ratio (SNR).

The time-domain characteristics are detailed in Figure 8. Comparing the raw and denoised waveforms, it is evident that the stochastic background micro-vibrations are attenuated. For complex damage types like Crater and Ablation, the main impact pulse peaks (approx. 1.6 mV) become distinguishable from the clutter, verifying that the subtraction process does not erode the transient shock features.

Furthermore, Figure 9 contrasts the Time–Frequency (TF) spectrograms. The separability scores ( $[eqn]$ ) marked in red quantify the contrast improvement. For the Normal state, $[eqn]$ increases from 1.59 to 2.44, indicating a cleaner background. Crucially, for Perforation, the denoising reveals hidden high-frequency transient bursts that were previously submerged in noise, validating the necessity of SS for feature extraction.

4.2.2. Efficacy of Distance-Aware Infrared Augmentation

Figure 10 visually demonstrates the proposed Distance-Aware Augmentation (DA) strategy. To ensure reproducibility and address the domain shift caused by varying imaging distances, we established specific mathematical definitions for the transformation zones based on the scaling factor $[eqn]$ (where $[eqn]$ is the baseline). Specifically, the Near-field zone ( $[eqn]$ ), defined by the threshold $[eqn]$ , simulates close-range inspection through Scale Up and Crop operations to emphasize the high-frequency textures of micro-cracks. The Mid-field zone ( $[eqn]$ ), covering the range $[eqn]$ , represents standard monitoring conditions, utilizing Random Rotation ( $[eqn]$ ) and perspective distortion to model probe angle deviations. Finally, the Far-field zone ( $[eqn]$ ), where $[eqn]$ , simulates remote detection by applying Scale Down and Pad operations, coupled with Gaussian Blur (radius $[eqn]$ ) and Gaussian Noise ( $[eqn]$ ) to replicate atmospheric attenuation and low-SNR degradation.

4.2.3. Ablation Study Results

To rigorously quantify the contribution of these modules, independent ablation studies were conducted on single-modality networks, as detailed in Table 3. In the vibration modality, the baseline model using raw STFT signals achieved an accuracy of only 51.67% due to severe noise interference. However, the application of Spectral Subtraction (SS) yielded a substantial performance boost, increasing accuracy by 44.16% to reach 95.83%. This confirms that denoising is a prerequisite for effective vibration feature extraction. Similarly, for the infrared modality, the proposed Distance-Aware (DA) strategy was compared against standard geometric augmentation. The DA method improved accuracy from 87.78% to 98.33% (a gain of +10.55%). This significant improvement verifies that physically modeling domain shifts, such as blur and scale variations, enables the network to learn robust features invariant to imaging distances.

4.3. Performance Comparison of Classification Models and Robustness Verification

This section evaluates the classification performance, establishing unimodal baselines before demonstrating the advantages of multimodal fusion. The analysis proceeds from training convergence verification to a rigorous quantitative comparison against state-of-the-art benchmarks.

The training progress, illustrated in Figure 11, indicates that both vibration and infrared branches effectively learn discriminative features. The loss function decreases rapidly approaching zero, and accuracy stabilizes near the optimum after approximately 10 epochs. These synchronized convergence trends validate the independent efficacy of both vibration and infrared inputs for damage recognition tasks.

4.3.1. Comparative Assessment of Unimodal Baselines

To assess architectural superiority, we evaluated the proposed method against four benchmarks, with quantitative results detailed in Table 4. As indicated by the training dynamics (Figure 12), the proposed dual-stream backbone exhibits the fastest convergence and highest stability, effectively overcoming the instability observed in MobileNetV2 (vibration stream) and the fluctuations in lightweight models (infrared stream).

Quantitatively, the proposed method establishes state-of-the-art baselines in both modalities. In the vibration domain, it attains 95.83% accuracy and 94.61% Kappa, significantly outperforming MobileNetV2 and classical CNNs by maintaining a balanced Precision-Recall profile (∼95%). In the infrared domain, compared with other baseline methods, the proposed method maintains excellent classification results (98.33% accuracy, 98.34% F1-Score). These results confirm the backbone’s robustness in extracting discriminative features from both high-frequency impact signals and thermal images.

4.3.2. Architectural Optimization: Attention Mechanism Analysis

To validate the rationale behind inserting the Channel Attention Mechanism (CAM) at the shallow bottleneck (Stage 1), we conducted a comparative ablation study against variants with no attention (Baseline), high-level attention (Stage 4), and full-stage attention. As shown in Table 5, the proposed Stage-1 strategy achieves optimal results in both modalities. Notably, placing attention only at high levels (Stage 4) resulted in negligible improvement or even degradation. From a physical perspective, this confirms that HVI damage is characterized by low-level texture features (e.g., micro-cracks) rather than high-level semantics. Enhancing these bottom-level features is crucial for identifying micro-damage signals in imbalanced datasets, further supporting the architectural rationality against overfitting concerns.

To further elucidate the impact on minority classes, Table 6 details the per-class metrics for the vibration modality. The High-level model fails to capture weak features, resulting in a Recall of only 0.5469 for the Crack class. In contrast, the Proposed method significantly improves the Crack Recall to 0.7656 and achieves perfect recognition (Recall 1.0) for Delamination and Ablation. This confirms that enhancing low-level feature extraction is essential for identifying subtle damage signals in imbalanced datasets.

4.3.3. Generalization Assessment via Distance-Domain Cross-Validation

To rigorously assess the model’s robustness against domain shifts, we conducted a targeted generalization test. While vibration signals exhibit high consistency across specimens due to contact measurement, the infrared modality involves non-contact imaging where distance variations introduce severe geometric scaling, serving as a stringent proxy for out-of-distribution (OOD) generalization. We employed a “Leave-One-Distance-Out” protocol, partitioning the dataset into distinct Near, Mid, and Far-field domains.

The cross-validation results in Table 7 confirm the model’s robustness. In Scenarios 2 and 3, where the training set included distance anchors (e.g., Near & Far), the model successfully generalized to the unseen Mid-field domain, achieving 76.11% accuracy. This capability to perform well on physically distinct domains with different noise distributions proves that the model learns intrinsic damage representations invariant to scale and blur. Consequently, the high training accuracy represents excellent fitting to these physical laws rather than memorization of background noise.

4.4. Synergistic Efficacy of Multimodal Decision-Level Fusion

This section quantitatively validates the synergistic effect of the D-S fusion mechanism. By integrating complementary evidence from vibration and infrared sensors, the fusion framework aims to overcome the physical limitations inherent in single-modality perception.

4.4.1. Category-Wise Performance Improvement

The visualization results in Figure 13 and Figure 14 illustrate the class-specific performance gains. The confusion matrices (Figure 13) reveal that unimodal approaches suffer from distinct misclassifications due to spectral or physical limitations. specifically, the vibration modality (Figure 13a) exhibits reduced sensitivity to “Crack” categorie, achieving only 76.7% accuracy for both. This shortfall is likely due to the subtle structural changes induced by “Crack”, which generate weaker global vibration responses compared to more severe damages like perforation.

Conversely, the infrared modality (Figure 13b) demonstrates strong texture recognition capabilities but exhibits slight instability in identifying “Perforation” (96.7%). The proposed D-S fusion strategy (Figure 13c) effectively synthesizes these complementary inputs. It not only matches the superior infrared performance for “Crack” and “Ablation” (recovering to 98.3% and 100.0%) but also corrects the infrared instability in “Perforation” cases. Consequently, the fusion method maintains a robust accuracy range of 96.7–100.0% across all categories, this confirms that the evidence-theoretic fusion successfully mitigates unimodal blind spots, ensuring high-reliability diagnosis.

4.4.2. Comparative Analysis of Fusion Strategies

To demonstrate the superiority of the evidence-theoretic approach under bandwidth-constrained on-orbit conditions, we benchmarked the Proposed D-S method against three standard decision-level fusion strategies: Average Fusion, Max-Confidence Fusion, and Weighted Fusion. The comparative results are presented in Table 8.

In the Normal scenario (full test set), all fusion strategies achieve high accuracy (>99%), exhibiting a ceiling effect due to the strong baseline performance. However, distinct performance gaps emerge in the “High Conflict” scenario (where unimodal predictions disagree). While Average Fusion maintains high accuracy through smoothing, its Precision drops to 0.8750. In contrast, the Proposed D-S method achieves the highest Precision (0.9476) and Macro-F1 (0.9476). This indicates that the D-S evidence theory offers a more rigorous mechanism for handling conflicting sensor inputs, reducing the risk of false positives which is critical for autonomous decision-making in space.

4.4.3. Robustness and Stability Verification

To assess the statistical reliability of the proposed framework, we conducted repeated experiments using five distinct random seeds (42, 123, 888, 2026, 3407) to account for variations in data partitioning and weight initialization. Table 9 reports the mean and standard deviation for key performance metrics.

The statistical results confirm that the fusion framework significantly enhances both accuracy and stability. The fusion method achieves a mean accuracy of 99.01% with a standard deviation of only 0.74%, which is substantially lower than the volatility observed in the single vibration modality ( $[eqn]$ ). Furthermore, the consistent high performance across randomized data partitions rules out specific bias to a fixed training set, confirming that the model has learned generalized damage features rather than overfitting to specific samples.

5. Discussion

5.1. Physical Interpretation of Misclassification Mechanisms

The misclassification patterns observed in the confusion matrices stem from the inherent physical limitations of single sensing modalities and the strong physical co-occurrence of HVI damage. In the vibration modality, the misidentification of micro-cracks as normal states arises from the global nature of modal responses; micro-scale defects often fail to induce statistically significant global stiffness changes or frequency shifts, creating a detection blind spot. Similarly, the infrared modality is constrained by thermal diffusion effects, where ablation and delamination generate indistinguishable surface temperature gradients due to similar thermal resistance boundaries under transient heating.

These physical ambiguities confirm that relying on a single sensor inevitably leads to decision uncertainty. By leveraging Dempster–Shafer (D-S) evidence theory, the proposed framework effectively resolves these conflicts. Unlike traditional weighted averaging, which risks propagating errors in high-conflict scenarios, the D-S rule utilizes the “uncertainty” mass to discount conflicting evidence, thereby achieving superior reliability in distinguishing physically coupled damage types.

5.2. Framework Scalability and Material Universality

The validity of this study is underpinned by high-fidelity physical experiments rather than numerical simulations. From a mechanistic perspective, the proposed detection framework exhibits theoretical scalability across different material systems. In vibration analysis, damage-induced local mass loss and stiffness degradation inevitably lead to eigenfrequency shifts, a phenomenon governing both metallic and composite structures. Similarly, infrared detection relies on thermal wave scattering at defect interfaces, a fundamental heat transfer behavior applicable to various materials. Therefore, while signal amplitudes may vary, the underlying feature extraction and D-S fusion logic remain transferable. Future applications on other spacecraft materials would primarily require parameter fine-tuning rather than architectural reconstruction.

5.3. Limitations and Future Work

Despite the demonstrated efficacy of the proposed framework in identifying damage categories, this study focuses primarily on discrete classification and has not yet addressed the quantitative assessment of damage severity (e.g., hole size or crack depth). The current model provides a qualitative diagnosis suitable for triggering alarms but lacks the capability for precise lifespan prediction. Future work will extend this architecture from classification to regression tasks, aiming to establish a continuous mapping between multimodal features and quantitative damage metrics, thereby providing a more comprehensive solution for spacecraft structural health monitoring.

6. Conclusions

This study proposes a physics-informed multimodal fusion framework for reliable spacecraft HVI damage monitoring. By integrating vibration spectral subtraction and distance-aware infrared enhancement within a dual-stream ResNet architecture, the method overcomes unimodal physical limitations such as stiffness insensitivity and thermal ambiguity. A core innovation is the Dempster–Shafer (D-S) evidence-theoretic mechanism, which resolves inter-modal conflicts at the decision level.

Experiments on multi-class HVI specimens demonstrate that the fusion strategy achieves a mean accuracy of 99.01%, significantly outperforming unimodal baselines (92.96% and 97.11%). The D-S mechanism effectively corrects conflicting evidence, achieving perfect precision in four categories. Framework stability is confirmed by a low standard deviation of 0.74% in 5-fold random seed tests, while “Leave-One-Distance-Out” cross-validation proves robust generalization across imaging distances. Furthermore, ablation studies verify that the shallow-layer Channel Attention Mechanism is critical for capturing micro-scale damage features.

In summary, this work establishes a high-precision baseline for on-orbit diagnosis. Future research will extend the framework from discrete classification to continuous regression for quantitative damage assessment and further optimize the model for satellite edge computing deployment.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Wen K. Chen X.W. Lu Y.G. Research and development on hypervelocity impact protection using Whipple shield: An overview Def. Technol.2021171864188610.1016/j.dt.2020.11.005 · doi ↗
2Huang X.G. Yin C. Ru H.Q. Zhao S.M. Deng Y.J. Guo Y.J. Liu S. Hypervelocity impact damage behavior of B 4C/Al composite for MMOD shielding application Mater. Des.202018610832310.1016/j.matdes.2019.108323 · doi ↗
3Yu S. Fan C. Zhao Y. Hypervelocity impact detection and location for stiffened structures using a probabilistic hyperbola method Sensors 202222300310.3390/s 2208300335458988 PMC 9031354 · doi ↗ · pubmed ↗
4Wu C.Y. He Q.G. Chen X.W. Zhang C.B. Shen Z.B. Debris cloud structure and hazardous fragments distribution under hypervelocity yaw impact Def. Technol.20232716918310.1016/j.dt.2022.09.010 · doi ↗
5Shang C. Ren T.F. Zhang Q.M. Lu Y.Y. Long R.R. Guo X.H. Hu X. Experimental research on damage characteristics of multi-spaced plates with long rods of steel and W-Zr reactive material at hypervelocity impact Mater. Des.202221611056410.1016/j.matdes.2022.110564 · doi ↗
6He Q.G. Chen X.W. Simulation method of debris cloud from fiber-reinforced composite shield under hypervelocity impact Acta Astronaut.202320440241710.1016/j.actaastro.2023.01.008 · doi ↗
7Wu C.Y. Chen X.W. He Q.G. Study on damage mechanism and damage distribution of the rear plate under impact of debris cloud Def. Technol.20243515116710.1016/j.dt.2024.01.005 · doi ↗
8Chen Y. Tang Q.Y. He Q.G. Chen L.T. Chen X.W. Review on hypervelocity impact of advanced space debris protection shields Thin-Walled Struct.202420011187410.1016/j.tws.2024.111874 · doi ↗