# A Welding Defect Detection Model Based on Hybrid-Enhanced Multi-Granularity Spatiotemporal Representation Learning

**Authors:** Chenbo Shi, Shaojia Yan, Lei Wang, Changsheng Zhu, Yue Yu, Xiangteng Zang, Aiping Liu, Chun Zhang, Xiaobing Feng

PMC · DOI: 10.3390/s25154656 · Sensors (Basel, Switzerland) · 2025-07-27

## TL;DR

This paper introduces a welding defect detection model that combines deep learning and handcrafted features to improve accuracy and interpretability in real-time welding quality monitoring.

## Contribution

A novel hybrid-enhanced multi-granularity spatiotemporal representation learning algorithm is proposed to address interference and interpretability issues in welding defect detection.

## Key findings

- The model achieves 99.187% accuracy on a self-constructed welding dataset.
- It processes each sample in 20.983 ms on an Intel i9-12900H CPU and RTX 3060 GPU.
- The hybrid approach effectively balances accuracy, speed, and interpretability in complex welding scenarios.

## Abstract

Real-time quality monitoring using molten pool images is a critical focus in researching high-quality, intelligent automated welding. To address interference problems in molten pool images under complex welding scenarios (e.g., reflected laser spots from spatter misclassified as porosity defects) and the limited interpretability of deep learning models, this paper proposes a multi-granularity spatiotemporal representation learning algorithm based on the hybrid enhancement of handcrafted and deep learning features. A MobileNetV2 backbone network integrated with a Temporal Shift Module (TSM) is designed to progressively capture the short-term dynamic features of the molten pool and integrate temporal information across both low-level and high-level features. A multi-granularity attention-based feature aggregation module is developed to select key interference-free frames using cross-frame attention, generate multi-granularity features via grouped pooling, and apply the Convolutional Block Attention Module (CBAM) at each granularity level. Finally, these multi-granularity spatiotemporal features are adaptively fused. Meanwhile, an independent branch utilizes the Histogram of Oriented Gradient (HOG) and Scale-Invariant Feature Transform (SIFT) features to extract long-term spatial structural information from historical edge images, enhancing the model’s interpretability. The proposed method achieves an accuracy of 99.187% on a self-constructed dataset. Additionally, it attains a real-time inference speed of 20.983 ms per sample on a hardware platform equipped with an Intel i9-12900H CPU and an RTX 3060 GPU, thus effectively balancing accuracy, speed, and interpretability.

## Full-text entities

- **Diseases:** stomatal defect (MESH:D013280), injury to (MESH:D014947), pore defects (MESH:D000013), fatigue (MESH:D005221)
- **Chemicals:** GMAW (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12349105/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12349105/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12349105/full.md

---
Source: https://tomesphere.com/paper/PMC12349105