DAS-YOLOv13: Dual-Axis Attention and Feature Fusion Model for Wafer Surface Defect Detection

Jingzhe Zhang; Rui Sun; Bo Li; Dexin Kong; Dejin Zhao; Jianhai Zhang

PMC · DOI:10.3390/s26051574·March 2, 2026

DAS-YOLOv13: Dual-Axis Attention and Feature Fusion Model for Wafer Surface Defect Detection

Jingzhe Zhang, Rui Sun, Bo Li, Dexin Kong, Dejin Zhao, Jianhai Zhang

PDF

Open Access

TL;DR

This paper introduces DAS-YOLOv13, a new model for detecting tiny defects on wafers in semiconductor manufacturing, improving detection accuracy and reliability.

Contribution

The novel DAS-YOLOv13 model integrates dual-axis attention, adaptive multi-scale representation, and self-modulation feature aggregation for enhanced wafer defect detection.

Findings

01

DAS-YOLOv13 achieves a mean Average Precision (mAP) of 74.2% on the wafer defect dataset.

02

The model improves detection accuracy for tiny and multi-scale defects by 4.3% compared to YOLOv13n.

03

It reaches an Average Precision at an Intersection over Union (IoU) threshold of 50% (mAP50) of 92.9%.

Abstract

What are the main findings? This paper proposes a dual-axis attention-enhanced YOLOv13 framework to suppress lithographic textures and enhance direction-sensitive tiny wafer defect features.This paper introduces adaptive dynamic multi-scale representation and self-modulation feature aggregation to improve cross-scale feature alignment and fine-grained defect representation. This paper proposes a dual-axis attention-enhanced YOLOv13 framework to suppress lithographic textures and enhance direction-sensitive tiny wafer defect features. This paper introduces adaptive dynamic multi-scale representation and self-modulation feature aggregation to improve cross-scale feature alignment and fine-grained defect representation. What are the implications of the main findings? More accurate and stable detection of tiny and multi-scale wafer surface defects can be achieved, even under complex…

Figures13

Click any figure to enlarge with its caption.

Funding1

—the Science and Technology Department of Jilin Province

Keywords

waferdefect detectionDAS-YOLOv13dual-axis attention module

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Advanced Neural Network Applications · Advanced Data and IoT Technologies

Full text

1. Introduction

Semiconductor manufacturing technology is continuously evolving toward higher-precision processes and finer circuit structures. As the core substrate for chip fabrication, wafers need to accommodate exponentially increasing numbers of transistor arrays and interconnection lines via processes such as photolithography and etching. This, in turn, leads to a marked rise in the pattern complexity and structural integration of the wafer surface [1]. Consequently, the surface quality of wafers directly impacts chip reliability and the final product value. To ensure high-quality surfaces with complex shapes and structures, it is imperative to maintain surface integrity and eliminate defects. However, due to the limitations of traditional production processes and variations in operator skills, common surface defects such as scratches, edge bites, stains embedded, gray lines, and short circuits frequently occur during manufacturing. Comprehensive quality inspection of wafer surfaces is therefore crucial for the timely identification of non-conforming products.

Nevertheless, manual inspection remains the core defect identification method for most semi-conductor manufacturers. This approach is not only time-consuming and cumbersome but also prone to result deviations and inconsistent performance due to operator visual fatigue [2]. Furthermore, the processing speed of manual inspection can no longer meet the requirements for efficient detection in modern industrial production. Thus, developing defect detection algorithms to realize automatic surface defect recognition is a key initiative to promote industrial automation upgrading and ensure product quality [3].

Traditional image detection algorithms primarily rely on manually designed defect feature extraction and classification systems. Piao et al. [4] used Radon transforms to extract geometric features and decision tree ensembles for recognition. Cai et al. [5] achieved scratch localization via optical edge enhancement and morphological operations. Kim et al. [6] combined color information and Raman spectroscopy for SiC wafer analysis. He et al. [7] optimized spectral clustering for large-scale data. However, these methods largely depend on prior information, and their generalization capabilities are limited by the insufficient effectiveness of feature extraction modules, making them difficult to adapt to complex and dynamic industrial scenarios.

The rise of deep learning technology has provided a new pathway for defect detection. Its end-to-end learning capability can automatically mine deep feature correlations between defects and backgrounds, enabling fast and high-precision defect detection that meets the robustness and efficiency requirements of complex industrial environments [8]. Currently, mainstream object detection models are categorized into two-stage and one-stage architectures: Two-stage models such as Cascade R-CNN [9] and Mask R-CNN [10] generate candidate regions followed by fine-grained classification and localization, achieving high precision in wafer defect detection. However, their complex two-stage process results in slow inference speed, which is difficult to match the transmission speed of semiconductor production lines. One-stage models such as FCOS [11], CenterNet [12], and the YOLO series [13] directly perform classification and localization inference on features, effectively reducing detection latency. While balancing precision and efficiency, they offer superior real-time detection performance.

In industrial defect detection tasks, modifying the architecture of YOLO models has emerged as a pivotal strategy for enhancing model performance. For instance, Yu et al. [14] strengthened multi-scale feature fusion by redesigning the neck module of YOLOv5, achieving robust performance in corrosion defect detection; Deng et al. [15] integrated deformable convolution and enhanced attention mechanisms into YOLOv8s, yielding remarkable improvements in small-object detection for UAV-based inspection scenarios; Zhang et al. [16] optimized the feature representation capability of YOLO11 via spatial-channel reconstruction convolution, thereby boosting detection accuracy in maritime search and rescue missions; Cheng et al. [17] introduced a dynamic detection head into YOLO11, enabling the model to maintain high detection precision while achieving model lightweighting. Collectively, these studies demonstrate that YOLO-series models possess excellent adaptability and considerable scalability for diverse industrial defect detection tasks.

In the wafer and semiconductor domain, deep learning has similarly demonstrated remarkable application potential. Chen et al. [18] proposed ESPP-Net, which integrates convolutional networks, SE modules, and attention-guided spatial pyramid pooling to achieve high-precision recognition of both single-type and mixed-type wafer map defects. Zhang et al. [19], building upon YOLOv8, enhanced the reliability of wafer defect detection in the ion implantation process by optimizing positive-negative sample assignment and introducing an improved global attention mechanism. Huang et al. [20] developed FMKA-Net, establishing an interaction channel for shallow and deep features via 2D discrete wavelet transform, attention fusion modules, and spatial pyramid pooling—this network exhibits stronger robustness in wafer images with significant noise interference. While all the aforementioned methods have achieved promising results in their respective scenarios, they primarily focus on a single aspect, such as local feature enhancement, attention modeling, or noise suppression.

As the latest iteration, YOLOv13 [21] boasts distinct speed advantages owing to its design of direct bounding box and category regression. By optimizing network depth and width as well as refining label assignment strategies, it strikes a balance between accuracy and speed in industrial small-object detection tasks, thus emerging as a preferred model for industrial deployment. Nevertheless, direct application of YOLOv13 to wafer defect detection remains plagued by no-table challenges: wafer defects occupy an extremely small proportion in images, rendering the capture of tiny defect features arduous; in deep networks, defect information is overwhelmed by the strong texture features of lithographic patterns [22]; additionally, the dynamic coexistence of multi-scale defects conflicts with the model’s insufficient adaptability. Although the Feature Pyramid Network (FPN) [23] of YOLOv13 enables multi-scale feature fusion, its adaptability to extreme scale differences is limited. To address the aforementioned issues, we propose the DAS-YOLOv13 model, specifically tailored for small-object defect detection in wafers. By optimizing the architecture of YOLOv13, DAS-YOLOv13 enhances detection accuracy, making it an ideal solution for deployment in industrial defect detection environments.

The main contributions of this study are as follows:

(1)To enhance the extraction and integration of key information, we propose a Dual-Axis Attention (DAA) module. This module effectively extracts features from different samples by strengthening key features and suppressing redundancy.
(2)We design an Adaptive Dynamic Multi-scale Representation (ADMR) module to enhance the model’s ability to capture multi-scale features, enabling the model to focus on key wafer defect features and improve feature extraction effectiveness.
(3)We introduce a Self-Modulation Feature Aggregation (SMFA) module, which accurately refines fine-grained defect features through deterministic spatial mixing attention and learnable downsampling.

The subsequent structure of this paper is organized as follows: Section 2 briefly outlines the YOLOv13 baseline framework, focusing on model design and network structure optimization ideas; Section 3 describes the creation of the custom Wafer Defect Dataset (WDD) for automated inspection scenarios; Section 4 presents experimental results, including comparative experiments with mainstream object detection algorithms and module ablation experiments to systematically verify the effectiveness of the proposed model; Section 5 summarizes the experimental results, condenses the core research contributions, and outlines future directions.

2. Materials and Methods

2.1. YOLOv13

Leveraging the efficiency and precision of one-stage detectors, the YOLO series has sustained a leading stance in real-time visual tasks. As shown in Figure 1, YOLOv13-N, adopted as the base model in this study, consists of backbone, neck, and head modules in its architecture.

YOLOv13 introduces two core innovations: Hyper-graph Adaptive Correlation Enhancement (HyperACE) for high-order semantic feature fusion and Full-Path Aggregation and Distribution (FullPAD) for cross-module information integration. Complemented by depthwise separable convolutions, the model achieves a balanced precision-efficiency tradeoff through parameter and computational cost reduction. The YOLOv13 team has released four scale-varied pre-trained models (Nano/Small/Large/Extra-Large), with network configuration differences enabling adaptation to diverse deployment scenarios.

Despite excelling in real-time performance and cross-scale correlation modeling, YOLOv13 encounters challenges with extremely small objects and strong texture backgrounds (e.g., wafer lithographic patterns), characterized by information overwhelming and response dilution. To mitigate these limitations, this study proposes lightweight, target-aware feature enhancement modules within the YOLOv13 framework, aiming to improve the discrimination of tiny weak-texture defects without substantial computational overhead.

2.2. Improvements to YOLOv13

To improve the precision of wafer defect detection, this study makes targeted improvements to the backbone and neck networks of YOLOv13, proposing the DAS-YOLOv13 model, as shown in Figure 2. These improvements enhance the network architecture.

Specifically, a Dual-Axis Attention (DAA) module is embedded after the deep A2C2f layer of the backbone network. Through row and column dual-dimensional feature enhancement and redundancy suppression, it effectively improves the discriminability of defect information in complex backgrounds. In addition, at the multi-scale feature fusion node of the neck network, we propose an Adaptive Dynamic Multi-scale Representation (ADMR) module. Through a dynamic branch adaptation mechanism, it efficiently captures defect features of the same scale, optimizing the reduction of additional parameters and computational cost increases while enhancing feature extraction effectiveness. Although the aforementioned methods can capture global and multi-scale features, they lack sufficient extraction of local defect details. To address this, a Self-Modulation Feature Aggregation (SMFA) [24] module is introduced.

2.2.1. Core Innovation I: DAA Module

A core challenge in wafer surface defect detection is that tiny defect features are often mixed with strong lithographic texture backgrounds, with blurred boundaries that are difficult to distinguish, making it challenging for the model to accurately capture target features. Feature enhancement is the key to overcoming this dilemma. While self-attention can model long-range dependencies to enhance features, the computational complexity of standard self-attention grows with the feature map size in an O(N^2^) [25] manner, where O(·) describes the growth trend of computational cost and N denotes the number of key feature points extracted from the industrial defect image. This means the overall computational load is directly proportional to the square of the number of feature points, resulting in excessively low efficiency in high-resolution wafer detection scenarios that cannot meet the real-time requirements of industrial applications. This contradiction makes it difficult to apply directly. The axial attention proposed by Jonathan Ho et al. [26] reduces complexity by decomposing attention calculations along the single dimensions of tensor rows and columns, providing key insights for lightweight attention modeling.

Accordingly, this study proposes a Dual-Axis Attention (DAA) module tailored for wafer detection needs: First, channel features are mapped to a unified space, and bidirectional structural features are synchronously captured through parallel row-column attention. Then, the dual-axis outputs are integrated with the global context. Under a lightweight framework, axial multi-scale features are efficiently fused, solving the efficiency problem of standard self-attention while enhancing the capture of direction-sensitive features of tiny defects, laying a foundation for subsequent accurate detection.

As shown in Figure 3, the DAA module consists of a dynamic weight generation component and an axial dependency modeling component. In the dynamic weight generation component, a lightweight controller guided by global context adaptively predicts channel attention in the row and column directions. Meanwhile, in the axial dependency modeling component, weighted features are aggregated along the row and column axes and coordinately fused, thereby achieving stable and direction-sensitive feature enhancement through residual paths.

First, for the input feature map $[eqn]$ , the DAA module first extracts global context through global average pooling:

[eqn]

where $[eqn]$ is the global statistical vector. Subsequently, through a lightweight controller:

[eqn]

where the channel dimension is compressed to $[eqn]$ to obtain the control feature $[eqn]$ .

The control feature is then fed into two independent 1 × 1 convolution layers to generate row-related and column-related channel attention scores, respectively. To model intra-channel competition, the attention scores are normalized by a softmax operation along the channel dimension, producing the final channel attention weights Sr and Sc. Based on the learned channel attention, row-aware and column-aware feature responses are aggregated along the height and width axes, respectively:

[eqn]

[eqn]

where ⊙ denotes element-wise multiplication, $[eqn]$ denotes the row-wise feature response emphasizing horizontal structural information, and $[eqn]$ denotes the column-wise feature response highlighting vertical structural characteristics, forming direction-aware feature representations for subsequent fusion.

The two components, respectively constitute the structural responses of the input features in the vertical and horizontal directions. Subsequently, these directional responses are combined through element-wise addition, normalized by a sigmoid function, and used to modulate the original input feature via element-wise multiplication. A residual connection is added to preserve the original feature information and stabilize the optimization, yielding the final direction-enhanced output:

[eqn]

where $[eqn]$ is a softmax function used to fit the linear transformation of global features, $[eqn]$ and $[eqn]$ are learnable weights and $[eqn]$ is the final output feature map.

In summary, through global context-driven dynamic attention generation, directionally decomposed axial dependency modeling, and a channel competition mechanism, the DAA module effectively improves the model’s ability to capture direction-sensitive structural information. Its selective weight adjustment and lightweight aggregation design enable the module to maintain stable attention responses in scenarios with complex textures and variable scales, thereby achieving more efficient feature representation at a lower computational cost.

As shown in Figure 4, this study visualizes the impact of DAA on defect extraction. The more red the color, the higher the feature attention of the model to the region; The more blue the color, the lower the attention. As illustrated, the DAA significantly optimizes feature representation in defect detection tasks: Without DAA, the activation of defect regions in the feature heatmap shows discrete weak responses, and the distinguishability between background redundant information and target features is insufficient; After introducing DAA, the red highlighted regions expand concentratedly, the feature activation intensity and coverage of the model on defect regions are significantly improved, and the responses of irrelevant background regions are effectively suppressed. This indicates that the DAA module achieves the enhancement of key defect features and the suppression of redundant background information, improving the discriminability and task adaptability of feature representation. However, it does not fully solve the problem of defect feature capture. To this end, we introduce an adaptive dynamic multi-scale representation module for more effective recognition.

2.2.2. Core Innovation II: ADMR Module

The coexistence of multi-scale defects and the imbalance of cross-scale feature alignment are key bottlenecks in wafer detection. The fixed-weight fusion strategy of YOLOv13 cannot dynamically adapt, directly leading to feature conflicts and reduced localization precision. We introduce an Adaptive Dynamic Multi-scale Representation (ADMR) module in the neck network, integrating multi-scale convolution structures [27], feature calibration, and attention gating mechanisms [28] to achieve unified alignment, redundancy suppression, and fine-grained weight control of cross-scale features. This addresses the scale bias problem of traditional fusion, enhances the capture ability of extreme-scale defects, and supports the accurate detection of multi-scale defects.

As shown in Figure 5, the ADMR module calculates the weights w required for different-sized convolution kernels based on the input and selects the most appropriate convolution kernel for comprehensive image information extraction.

In this study, ki is set to 1, 3, and 5. As illustrated, the input tensor x first enters three convolution branches with different receptive fields: 1 × 1 for capturing local linear mappings, 3 × 3 for modeling medium-scale structures, and 5 × 5 for perceiving wider-range context. Due to the inherently different statistical behaviors of different convolution kernels, direct fusion often causes feature scale conflicts. Therefore, after convolution, each branch first undergoes normalization processing through a feature calibration (FC) module. The feature distributions of each branch are aligned through standardization operations of global statistics (mean and variance), and further fine-tuned through learnable scale parameters α and offsets β, thereby ensuring that the representations between branches are in a unified statistical domain and avoiding the impact of branch instability on subsequent fusion.

[eqn]

where μ(·) and σ^2^(·) are the global mean and variance, respectively, and α and β are learnable calibration parameters. After completing statistical alignment, each branch enters an attention gating module. This structure generates channel-wise gating weights through 1 × 1 convolution and Sigmoid activation to emphasize key responses and suppress redundant background features.

This lightweight channel selection mechanism can effectively filter the inherent noise amplification effect of convolution, enabling multi-scale features to have higher discriminability before fusion. Meanwhile, the ADMR module also introduces a dynamic branch weighting mechanism. It generates a weight vector by performing adaptive pooling on the input, and then combines learnable branch biases to achieve soft-selective importance assignment, allowing the model to select a more appropriate receptive field according to the content of the image region. For example, in texture-rich regions, the model will dynamically increase the weight of the 3 × 3 branch, while in structurally smooth regions or regions with high context requirements, it tends to favor the 5 × 5 branch.

[eqn]

where y1, y2, y3 denote the outputs of three convolutional branches with different receptive fields after statistical alignment and attention gating, and w1, w2, w3 represent the corresponding dynamically learned branch-wise importance weights. Concat(·) indicates channel-wise concatenation, while Conv(·) denotes a 1 × 1 convolution used for cross-scale feature interaction and channel compression. x is the input feature introduced through a residual connection.

Finally, the ADMR module aggregates the three groups of calibrated and gated features and completes cross-scale mixing through 1 × 1 convolution to form a consistent and robust multi-scale representation. Residual connections further ensure structural stability and improve optimization efficiency.

To verify the effect of the ADMR module, this study conducts a visualization analysis through heatmaps. As shown in Figure 6, without ADMR, the defect feature responses are discrete and severely interfered by background redundancy; After integrating ADMR, the feature attention (red highlighting) of the defect regions is significantly concentrated and enhanced, and the irrelevant background responses (blue) are accurately suppressed. This indicates that through feature calibration, attention gating, and dynamic multi-scale fusion, the ADMR mod-ule achieves unified multi-branch feature distribution, adaptive enhancement of defect features, and suppression of background redundancy, improving the discriminability and task adaptability of feature representation, effectively enhancing defect localization precision and feature capture ability, and providing a more robust feature representation scheme for defect detection in complex scenarios.

2.2.3. Core Innovation III: SMFA Module

Although the DAA module alleviates the background-defect separation problem, wafer detection still has shortcomings: The local fine-grained features of tiny defects are prone to loss, and the inter-channel correlations are not explored, resulting in insufficient recognition precision of low-resolution defects. To address this, the SMFA module is introduced in the neck.

As shown in Figure 7, within the SMFA module, an Efficient Approximation of Self-Attention (EASA) branch is implemented to explore non-local contextual dependencies at a moderate computational cost, enabling the model to capture long-range structural information across the feature map. In parallel, a Local Detail Estimation (LDE) branch is employed to focus on fine-grained spatial details, which are crucial for accurately representing tiny and weak-texture wafer defects. By jointly modeling global context and local details, the SMFA module achieves complementary feature enhancement, thereby improving the robustness and discriminability of the fused feature representations. Furthermore, the parallel branch design allows non-local and local features to be learned simultaneously without introducing excessive computational overhead. This balanced feature aggregation strategy effectively alleviates the loss of fine details in deep feature maps and supports more precise defect localization in complex wafer inspection scenarios.

For a given input feature $[eqn]$ (where C represents the number of channels and H × W indicates the spatial size of the feature map), first, 1 × 1 convolution is applied to the normalized x to expand the channels, and then, the channels are split into two parts as inputs to the two branches:

[eqn]

where ||.||2 denotes normalization, Conv_1×1_(·) denotes 1 × 1 convolution, and Split(·) represents the channel splitting operation.

The core function of the EASA branch is to explore global dependency relationships, and its implementation process is detailed as follows: First, max-pooling with a fixed kernel size is applied to x1 for downsampling, which extracts low-frequency components of the feature while reducing computational complexity. Then, a 3 × 3 depth-wise convolution is used to encode the global structural information from the downsampled features, generating the intermediate feature xₛ. This process is expressed as:

[eqn]

where m(·) denotes the max pooling, DWConv3×3(·) is a 3 × 3 depth-wise convolutional layer.

To enhance the discriminative power of the features, we introduce the variance information of the original input feature x for modulation. The variance of feature x, denoted as σ^2^(x), is defined based on all pixel values of x:

[eqn]

where $[eqn]$ is the variance of x, N is the total number of pixels, x_i_ denotes the value of each pixel, and $[eqn]$ is the mean of all pixel values.

Next, we fuse the variance information σ^2^(x) with the intermediate feature xₛ: first, xₛ is processed by a 1 × 1 convolution to adjust the channel dimension, then added to σ^2^(x), and activated by the GELU [29] function. The activated feature is then upsampled to the original spatial size (H × W) using the nearest upsampling method. Finally, the upsampled feature is multiplied element-wise with the original input x to generate the non-local feature xₗ. The entire integration process is:

[eqn]

where $[eqn]$ refers to the GELU activation function, $[eqn]$ (·) denotes a nearest upsampling operation, and ⊙ represents the element-wise product operation.

Since the EASA branch focuses on exploring non-local structural information, it may insufficiently capture local details, which are crucial for high-quality high-frequency feature reconstruction. To address this issue, we design the LED branch with a lightweight structure to capture local detail features. Specifically, x_2_ is first processed by an expanded 3 × 3 depth-wise convolution to encode local spatial information, generating the preliminary local feature yₕ. Then, yₕ is activated by GELU and processed by a 1 × 1 convolution to enhance feature representation, resulting in the enhanced local feature yd. This process is:

[eqn]

After obtaining the non-local feature xₗ from the EASA branch and the local feature y_d_ from the LED branch, we fuse these two features through element-wise addition. To ensure the output feature has the standard dimension (H × W × C), the fused feature is finally fed into a 1 × 1 convolution to adjust the channel dimension, generating the representative output feature y of the SMFA module. This fusion and output process is:

[eqn]

where $[eqn]$ is the output feature.

In summary, through EASA combined with separable convolution, the SMFA module achieves direction-aware local feature enhancement, realizing weighted enhancement of local direction features and nonlinear capture of inter-channel correlations. It accurately refines the fine-grained features of wafer defects, improves the model’s ability to represent structure-sensitive defects under low-resolution inputs, and helps enhance the precision and robustness of wafer defect detection.

We visualized the feature heatmaps of the SMFA module. As shown in Figure 8, the features without SMFA exhibited relatively scattered responses to defects, with a large number of invalid activations in the background regions. In contrast, after SMFA channel weighting and feature fusion, the activation intensity of defect regions was significantly enhanced, and irrelevant backgrounds were effectively suppressed. This demonstrates that SMFA can improve the discriminability of defect features.

2.3. Wafer Surface Defect Dataset

To advance the practical deployment of YOLO models in industrial inspection, we developed the Wafer Defect Detection (WDD) dataset using an automated defect inspection system. As illustrated in Figure 9, the dataset encompasses wafer samples across diverse process nodes and material specifications, emulating real-world industrial semiconductor production conditions. Equipped with a focusing bracket, the system guides the light source and camera to capture high-clarity images of inspected wafers. Images are transmitted to the inspection terminal, while a rotatable stage enables full-coverage scanning of the wafer surface. Vertically mounted above the target, the high-definition camera minimizes glare from the supplementary light source, preserving imaging fidelity.

Our automated defect inspection system employs a high-definition industrial camera (Model: TD-4KH, manufactured by Shenzhen Sanqiang Teda Optical Instrument Co., Ltd., Shenzhen, China) with a physical resolution of 3840 × 2160. Mounted vertically to suppress glare effectively, this camera ensures clear capture of micro-defect features on wafer surfaces. For high-quality imaging, the system integrates a coaxial light source operating within the 450~650 nm visible light band—a mainstream wavelength range for industrial semiconductor visual inspection. By further mitigating surface reflections on wafers, the coaxial light source enhances the imaging contrast of defect edges significantly. The focusing mechanism consists of a precision bracket with both coarse and fine adjustment knobs, featuring a fine adjustment gradation of 0.002 mm and a maximum stroke of 50 mm, enabling full-coverage, high-precision focusing for wafers of varying specifications. Additionally, the system’s rotatable stage offers an effective loading area of 187 mm × 160 mm, a two-dimensional movement range of 60 mm × 55 mm, and a fine adjustment gradation of 1 mm, supporting stable, uniform wafer movement and rotation to achieve non-redundant full-surface scanning. Together, these components synergize to establish a robust foundation for the industrial relevance and reliability of the WDD dataset.

Field validation identifies six primary surface defect types in wafers: Scratch, Edge-bite, Stains-embedded, Gray-line, Open, and Short-circuit. These categories cover both common structural defects and rare anomalous patterns, spanning a wide range of morphological characteristics and severity levels observed in practical semiconductor manufacturing processes. Given that wafer surface defects in industrial environments are susceptible to ambient light fluctuations, imaging equipment variations, and other confounding factors, we incorporated diverse exposure and illumination conditions during image acquisition to simulate realistic inspection scenarios. This design enhances the model’s robustness against lighting variations and surface texture inhomogeneities, reduces sensitivity to acquisition inconsistencies, and ensures the dataset’s representativeness and practical applicability in real-world industrial settings. Moreover, the inclusion of multi-condition samples facilitates the evaluation of model generalization performance under complex and variable inspection environments. As a result, the constructed dataset provides a reliable benchmark for assessing wafer defect detection methods in industrially relevant conditions.

In constructing the dataset, the annotation process was overseen by experienced semiconductor data labeling experts, adhering to standardized protocols in the semiconductor inspection domain. A dual-verification mechanism—integrating cross-validation by two annotators and algorithm-assisted review—was implemented to guarantee annotation consistency. The WDD dataset comprises 5605 images (640 × 640 pixels each), randomly partitioned into training, validation, and test sets at an 8:2 ratio: 4484 training images, and 561/560 images for validation/test sets, respectively. Defect category distribution is summarized in Table 1.

2.4. Experimental Environment and Evaluation Metrics

To verify the effectiveness and generalization ability of the proposed model, comparative experiments and ablation experiments are conducted on the WDD dataset. The experimental hardware configuration is as follows: PyTorch 2.3.0, Python 3.12.3, CUDA 12.1, and the hardware configuration is equipped with one NVIDIA GeForce RTX 2080 Ti graphics processing unit (GPU) with 11 GB of video memory, and one Intel (R) Xeon (R) Platinum 8255C CPU running at 2.50 GHz with 12 cores. Mosaic data augmentation [30] is adopted, and the specific implementation details are as follows: four training images randomly selected from the WDD dataset were first resized to 640 × 640 (consistent with the input image size of the model), and then the four resized images were stitched into a single 640 × 640 image in a 2 × 2 grid layout. During the stitching process, random scaling, random flipping and random translation were performed on each of the four images to enhance the diversity of the augmented data. In addition, random adjustments of brightness and contrast were applied to the stitched image to simulate different lighting conditions in real scenarios, thus improving the robustness of the model. This strategy was abandoned in the later stage of training (the last 10 epochs) to improve the model’s generalization ability. The training parameters were set as follows: 1500 training epochs, a batch size of 32, and an early stopping patience of 100. The initial learning rate adopts a cosine annealing strategy, with a momentum of 0.937 and a weight decay of 0.0005, and the optimizer selects Stochastic Gradient Descent (SGD) [31].

This study uses metrics such as Precision, Recall, and mean Average Precision (mAP) to evaluate the detection effects of current mainstream algorithms and the proposed improved model. Meanwhile, parameters such as the number of parameters (Params), Floating-Point Operations (FLOPs), and Frames Per Second (FPS) are used to evaluate the industrial application applicability of the model [32]. These parameters reflect the model’s parameter quantity, computational cost, and detection speed, respectively. Precision and Recall are often used to quantify the model’s prediction ability [33], while mAP is a widely adopted evaluation metric derived from Precision and Recall; Params and FLOPs are key indicators for evaluating model complexity and computational resources; FPS measures detection speed, representing the number of images processed per second. The calculation formulas of these metrics are as follows:

[eqn]

[eqn]

[eqn]

[eqn]

[eqn]

where True Positive (TP) denotes the number of samples correctly predicted as defects, False Positive (FP) denotes the number of samples incorrectly predicted as defects, and False Negative (FN) denotes the number of samples incorrectly predicted as non-defects. mAP represents the average AP when the Intersection over Union (IoU) [34] threshold ranges from 0.5 to 0.95, and mAP50 is the average precision when IoU = 0.5. In addition, the per-frame processing time is measured in milliseconds.

3. Results

3.1. Comparison with the Results of Advanced Models

The defect detection results of each model on the WDD dataset are shown in Figure 10 and Figure 11. Each detection image contains predicted bounding boxes, defect categories, and confidence scores. It can be seen from the results that DAS-YOLOv13 can accurately detect subtle defects and defects of different scales on the WDD dataset. This robust detection capability highlights the practical value of DAS-YOLOv13 in real-world wafer manufacturing quality control.

To systematically evaluate the generalization performance of DAS-YOLOv13, current mainstream defect detection algorithms are selected as comparison baselines, including anchor-based algorithms (ATSS [35], Faster R-CNN), anchor-free algorithms (TOOD [36], FCOS), lightweight YOLO series algorithms (YOLOv5, YOLOv8, YOLOv10 [37], YOLOv11, YOLOv13), and specialized defect detection algorithms (LD) [38].

All models are configured with unified experimental settings on the WDD dataset, and the evaluation metrics include comprehensive detection precision (mAP, mAP50), detection precision of typical defect categories (Edge-bite, Stains-embedded, Gray-line, Short-circuit, Open, Scratch), and model efficiency metrics (number of parameters (Params), computational volume (FLOPs), inference speed (FPS)). The detailed comparison results are shown in Table 2 (bold values in the table indicate the optimal performance for each metric). As shown in Table 2, in terms of overall detection precision, DAS-YOLOv13 achieves an mAP of 74.2% and an mAP50 of 92.9% on the WDD dataset, both of which are the highest among all methods. Especially for easily confused wafer defect types such as Scratch, Gray-line, and Short-circuit, DAS-YOLOv13 also achieves the optimal detection precision (84.2%, 76.5%, 81.2%), indicating that the proposed multi-scale enhancement and dual-axis attention modules effectively improve the recognition ability of multi-scale defects in the dataset.

In terms of detection speed, DAS-YOLOv13 operates at 28.9 FPS. Although it does not match the extremely high-speed detection capabilities of YOLOv5 and some lightweight models, it still meets the quasi-real-time requirements in industrial detection scenarios. In terms of model complexity, DAS-YOLOv13 only requires 5.45 M parameters and 20.4 G floating-point operations, achieving high-precision detection while maintaining extremely low computational costs.

In comparison with lightweight models such as YOLOv11 and YOLOv13, DAS-YOLOv13 achieves mAP improvements of 4.5% and 4.3%, respectively, with only a slightly larger model scale, and obtains the best detection rates for key wafer defect categories such as Scratch, Gray-line, and Short-circuit. Although YOLOv8 and YOLOv10 have higher inference speeds, their precision is far behind DAS-YOLOv13, with obvious shortcomings in tiny defect detection tasks. Considering comprehensively precision, speed, and model scale, DAS-YOLOv13 achieves a more balanced performance and is more suitable for the actual deployment requirements of high precision and lightweight in wafer detection.

Figure 12 shows a comprehensive comparison of each model in terms of FPS and mAP, where the circle size represents the parameter scale. It can be seen that DAS-YOLOv13 achieves a good balance between small parameters and superior performance. It thus outperforms other comparative models in the trade-off among detection accuracy, inference speed and model complexity.

To mitigate the interference of training stochasticity on experimental outcomes, all comparative models underwent 10 independent training runs, with their mean Average Precision (mAP) evaluated repeatedly on the test set. As illustrated in Figure 13, the mean values accompanied by error bars explicitly demonstrate each model’s stability across multiple training iterations.

The error bars denote the range between the maximum and minimum mAP values of each model, thereby comprehensively reflecting performance fluctuations under varying initialization conditions. It is evident that DAS-YOLOv13 yields the shortest error bars, indicating lower performance variance and superior training stability on the WDD dataset. Furthermore, the error range of DAS-YOLOv13 exhibits negligible overlap with that of other models, verifying that its performance gains are not only stable but also statistically significant.

Although the baseline model occasionally achieved higher mAP50 in individual training trials, this phenomenon does not signify superior overall performance; instead, it stems primarily from the baseline’s high sensitivity to random initialization. Since mAP50 only counts detection results under the relaxed criterion of Intersection over Union (IoU) ≥ 0.5, it is more susceptible to the influence of a small number of spuriously high-confidence predictions, which can form local performance peaks in certain training rounds. However, such peaks are typically coupled with substantial performance variance and cannot be stably reproduced across independent experiments. In contrast, DAS-YOLOv13 consistently delivered higher average performance and markedly lower variance across all 10 runs. This further corroborates that the proposed feature enhancement module effectively improves the model’s overall stability and generalization capability for multi-scale defect detection tasks.

3.2. Results of Ablation Experiments

Table 3 elaborates on the improvements implemented in YOLOv13, encompassing the Dual-Axis Attention (DAA), Adaptive Dynamic Multi-scale Representation (ADMR), and Self-Modulation Feature Aggregation (SMFA) modules.

The baseline YOLOv13 achieves a mAP of 69.9% and an mAP50 of 92.2%. Upon embedding the DAA module into the backbone and neck networks, the mAP increases by 1.1% to 71.0%, with the mAP50 reaching 92.9%. Furthermore, compared to the baseline, the detection accuracies for structure-sensitive defects such as Gray Line and Open are enhanced by 2.6% and 3.1%, respectively—findings that validate the efficacy of DAA’s directional decomposition-based feature enhancement in mitigating the adverse effects of defect morphological variations on detection performance.

With the integration of the ADMR module into the neck network, the model attains an mAP of 70.4%. Although the number of parameters increases from 2.44 M to 4.02 M and the FLOPs rise from 6.2 G to 11.3 G, the detection accuracies for Edge bite and Open defects remain improved. This observation underscores that the local fine-grained reconstruction mechanism of ADMR can strengthen feature consistency; the accompanying computational overhead exerts a controllable impact on overall inference efficiency, thus preserving the model’s deployment value in wafer production line scenarios.

When the SMFA module is further integrated, the overall scheme of DAA + ADMR + SMFA pushes the mAP to 74.2%, achieving the maximum performance gain. It performs particularly well in edge and slit-like defects, as evidenced by Edge-bite significantly increasing from 64.1% to 68.8% and Short Circuit increasing from 76.3% to 81.2%, fully verifying the advantages of SMFA in spatial fine-grained modeling. Although the number of parameters and FLOPs increases to 5.45 M and 20.4 G at this time, the comprehensive improvement of multi-type defects significantly reduces the risk of missed detection, making the performance gain far greater than the increase in computational cost. Integrating these improvements into DAS-YOLOv13, compared with other methods proposed in this study, it exhibits more excellent comprehensive detection performance, verifying the effectiveness of each enhancement.

4. Performance on Public HRIPCB Dataset

To further evaluate the generalization capability of the DAS-YOLOv13 model across diverse datasets, we conduct extensive experiments on HRIPCB [39]—a public benchmark dataset for printed circuit board (PCB) defect detection—and compare its performance with that of the YOLOv13 baseline model. Quantitative results are summarized in Table 4.

On the HRIPCB dataset, DAS-YOLOv13 attains an overall mAP of 51.5% and an mAP50 of 94.9%, marking respective improvements of 0.9% and 0.8% over the YOLOv13 baseline. In terms of defect-specific performance, the model boosts detection accuracy by 3.7% for mouse bite defects, and by 3.3% and 3.2% for spur and spurious copper defects, respectively. These findings demonstrate that our proposed DAS-YOLOv13 model exhibits good generalization on public PCB defect detection benchmarks, thereby further validating the effectiveness and soundness of its architectural design.

5. Conclusions

This paper proposes DAS-YOLOv13, a wafer surface defect detection model based on the YOLOv13 framework, to address the requirements of online inspection in industrial production lines. By integrating three key modules—Dual-Axis Attention (DAA), Adaptive Dynamic Multi-scale Representation (ADMR), and Self-Modulation Feature Aggregation (SMFA)—the model enhances perceptual capability for small-scale, low-texture defects while maintaining low computational overhead. Experimental results demonstrate that DAS-YOLOv13 boosts the mean Average Precision (mAP) from 69.9% to 74.2% with only a marginal increase in model parameters; notably, it achieves significant performance gains for critical defect categories, including edge chipping, gray streaks, and circuit shorts, and fully satisfies the real-time latency requirements of online inspection systems.

Despite these advances, our model still exhibits several limitations. Its potential for lightweight deployment on edge devices remains largely untapped, as network redundancy and computational complexity impede efficient execution on resource-constrained hardware. In addition, the lack of standardized scale annotation in wafer defect image acquisition and visualization limits the characterization of tiny wafer defects. In future work, we will focus on two key directions: On one hand, we will advance the model’s lightweight design and end-to-end acceleration via network pruning, mixed-precision quantization and knowledge distillation, enabling low-latency, low-power deployment on edge hardware for industrial automated inspection. On the other hand, we will optimize and unify the scale annotation specifications throughout the whole process of data collection, feature visualization and detection result presentation to achieve better characterization of tiny wafer defects. Ultimately, this research aims to provide a highly generalizable, efficient and lightweight solution that can be seamlessly integrated into resource-constrained automated inspection systems for practical industrial applications.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chen R. Li Y.-C. Cai J.-M. Cao K. Lee H.-B.-R. Atomic level deposition to extend Moore’s law and beyond Int. J. Extrem. Manuf.2020202200210.1088/2631-7990/ab 83e 0 · doi ↗
2Yu N. Li H. Xu Q. A full-flow inspection method based on machine vision to detect wafer surface defects Math. Biosci. Eng.202320118211184610.3934/mbe.202352637501422 · doi ↗ · pubmed ↗
3Huajun S. Guochao S. Jinbo W. Efficient vision defect detection model for drill pipe thread J. Electron. Imaging 20253403301010.1117/1.jei.34.3.033010 · doi ↗
4Piao M. Jin C.H. Lee J.Y. Byun J.Y. Decision Tree Ensemble-Based Wafer Map Failure Pattern Recognition Based on Radon Transform-Based Features IEEE Trans. Semicond. Manuf.20183125025710.1109/TSM.2018.2806931 · doi ↗
5Cai P. Liu A. Gao L. Dai S. Wu Q. Long Y. Huang L. Zhu T. High-speed wafer surface defect detection with edge enhancement via optical spatial filtering in serial time-encoded imaging Opt. Laser Technol.202518011144210.1016/j.optlastec.2024.111442 · doi ↗
6Kim J.G. Yoo W.S. Jang Y.S. Lee W.J. Yeo I.G. Identification of Polytype and Estimation of Carrier Concentration of Silicon Carbide Wafers by Analysis of Apparent Color using Image Processing Software ECS J. Solid State Sci. Technol.20221106400310.1149/2162-8777/ac 760e · doi ↗
7He L. Ray N. Guan Y. Zhang H. Fast Large-Scale Spectral Clustering via Explicit Feature Mapping IEEE Trans. Cybern.2019491058107110.1109/TCYB.2018.279499829994519 · doi ↗ · pubmed ↗
8Xie W. Sun X. Ma W. A light weight multi-scale feature fusion steel surface defect detection model based on YOL Ov 8Meas. Sci. Technol.20243505501710.1088/1361-6501/ad 296d · doi ↗