TaDP-Det: Semi-Supervised Texture-Aware Dynamic Pseudo-Labeling Detector for Industrial Surface Defect Detection

Qiwu Luo; Weiyu Zhan; Jiaojiao Su

PMC · DOI:10.3390/s26041085·February 7, 2026

TaDP-Det: Semi-Supervised Texture-Aware Dynamic Pseudo-Labeling Detector for Industrial Surface Defect Detection

Qiwu Luo, Weiyu Zhan, Jiaojiao Su

PDF

Open Access

TL;DR

This paper introduces TaDP-Det, a new semi-supervised method for detecting surface defects in industrial settings using texture-aware pseudo-labeling.

Contribution

The novel approach combines texture enhancement and dynamic label filtering to improve pseudo-label quality in semi-supervised defect detection.

Findings

01

TaDP-Det outperforms existing semi-supervised object detection methods on multiple industrial defect datasets.

02

The proposed Texture Enhance Module improves pseudo-label reliability in ambiguous regions.

03

Class-wise dynamic filtering enhances detection accuracy by adapting to defect-specific challenges.

Abstract

Surface defect detection is essential for industrial quality control, but obtaining reliable labeled data remains costly due to the need for expert annotation. Semi-supervised object detection (SSOD) mitigates this need by leveraging unlabeled data through pseudo-labeling. However, industrial surface imagery presents specific challenges, including texture-ambiguous, low-contrast backgrounds that cause foreground–background confusion and strong class-dependent detection difficulty, which renders global confidence thresholds ineffective, often yielding noisy and imbalanced pseudo labels. To overcome these limitations, we propose TaDP-Det, a semi-supervised detector that improves pseudo-label quality through dual enhancements in feature representation and label filtering. We first introduce a Texture Enhance Module (TEM), designed as a texture-aware patch-level mixture-of-experts applied…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Mus musculus(house mouse · species)Homo sapiens(human · species)

Chemicals4

Cr PCB AP copper

Diseases4

injury to surface defect CDPF GMM

Figures7

Click any figure to enlarge with its caption.

Funding1

—National Natural Science Foundation of China

Keywords

surface defect detectionsemi-supervised object detectionpseudo labelstexture enhancement

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Industrial Vision Systems and Defect Detection · Medical Image Segmentation Techniques

Full text

1. Introduction

Automated visual inspection (AVI) has become a key component of modern industrial manufacturing, playing a central role in ensuring production efficiency and product reliability [1]. Among various inspection tasks, surface defect detection is particularly critical, as undetected defects may lead to safety risks and substantial economic losses. While deep learning-based object detection has achieved remarkable success in natural image domains, its direct application to industrial defect inspection is severely constrained by the prohibitive cost of obtaining large-scale, reliably annotated datasets. Unlike annotating common objects in natural scenes, which is a task supported by standardized protocols, labeling industrial defects demands substantial domain expertise. Annotators must identify subtle, often low-contrast anomalies, adhere to application-specific taxonomies, and maintain consistent localization amidst ambiguous boundaries [2]. Consequently, the annotation process becomes expensive, time-consuming, and susceptible to human inconsistency, creating a fundamental knowledge and cost gap that limits the scalability of fully supervised approaches in industrial settings.

Semi-supervised object detection (SSOD) presents a promising solution by leveraging abundant unlabeled data through pseudo-labeling within a teacher–student framework, thereby reducing reliance on manual annotations. However, the unique characteristics of industrial surface imagery pose distinct challenges that hinder the direct application of existing SSOD methods, which were primarily developed and validated on natural-image benchmarks (e.g., Soft Teacher [3] and Unbiased Teacher [4]). Research on SSOD for industrial inspection is still evolving. Early efforts to reduce annotation costs often focused on weak supervision [5]. More recent works have begun to explore the teacher–student paradigm for this domain [6], progressing along several practical avenues: (1) enhancing pseudo-label generation through stronger training strategies [7], (2) employing data-centric selection to prioritize informative unlabeled samples [8], and (3) incorporating auxiliary priors or adaptive mechanisms to mitigate label noise and class imbalance [9,10]. While these advancements demonstrate the feasibility of SSOD for industrial tasks, they often rely on generic, class-agnostic filtering heuristics. As a result, they frequently fail to explicitly address two core, intertwined challenges inherent to surface defect detection: pervasive texture ambiguity and pronounced class-dependent detection difficulty.

Texture ambiguity exacerbates pseudo-label noise. Industrial surfaces often exhibit cluttered, repetitive backgrounds with weak contrast, where defects blend into the substrate. This, coupled with ambiguous foreground–background boundaries, leads to significant confusion, especially in the shallow, texture-sensitive layers of a neural network. Consequently, the teacher model may miss genuine defects, assign them low confidence, or generate class-indistinct false positives, propagating errors throughout the self-training cycle.

Detection difficulty is highly class-dependent. Defect categories within the same dataset can vary dramatically in visual salience. For instance, subtle defects like Crazing (in the NEU-DET dataset [11]) systematically receive lower confidence scores from a detector than more conspicuous defects like Inclusion or Patch. Applying a single, global confidence threshold across all classes, which is a common practice in general SSOD, inevitably discards true positives from “hard” classes while retaining easier, potentially noisier pseudo-labels from “easy” classes. This results in imbalanced and biased supervision, ultimately degrading model performance on underrepresented or challenging defect types.

To address these dual challenges, we propose TaDP-Det, a semi-supervised detector that enhances pseudo-label quality from complementary feature-level and filtering-level perspectives. Our contributions are as follows:

Texture Enhance Module (TEM), integrated as a texture-aware, patch-level mixture-of-experts at the shallow backbone stage. TEM strengthens discriminative low-level texture cues, thereby improving feature discrimination in ambiguous regions and yielding more reliable pseudo-labels from the teacher model.
Class-wise Dynamic Pseudo-label Filtering (CDPF), a mechanism that employs lightweight, per-class Gaussian Mixture Models (GMMs) to dynamically estimate category-specific confidence thresholds from the evolving score distributions during training, effectively preserving challenging true positives while suppressing noise and alleviating supervision imbalance.

We have conducted extensive experiments on three public industrial defect datasets: NEU-DET [11], GC10-DET [12], and PCB-DEFECT [13]. Results demonstrate that our TaDP-Det consistently outperforms state-of-the-art SSOD baselines in mean average precision (mAP) under various labeled-data ratios (1%, 5%, 10%). Notably, the class-wise dynamic pseudo-label filtering is applied only during the training phase; the final inference is performed with a single, streamlined detector equipped with the proposed TEM, ensuring practical efficiency for industrial deployment.

The rest of this paper is organized as follows: Section 2 reviews related work on semi-supervised object detection, feature refinement, and label assignment. Section 3 presents the proposed method in detail. Section 4 reports experimental results and ablation studies. Section 5 concludes the paper.

2. Related Works

2.1. Semi-Supervised Object Detection

Semi-supervised object detection (SSOD) aims to leverage abundant unlabeled data to reduce annotation costs, primarily through the teacher–student self-training paradigm. Within this framework, a teacher model generates pseudo-labels to supervise a student detector, with iterative refinements improving overall performance. Early works like Soft Teacher [3] enhanced pseudo-label reliability through confidence-weighted supervision and box jittering. Subsequent research has focused on addressing specific challenges: Unbiased Teacher v2 [14] introduced the Listen2Student mechanism to improve pseudo-supervision for dense detectors, while DSL [15] stabilized training for one-stage architectures via adaptive filtering and consistency regularization. Recent advances have further refined pseudo-label quality and assignment. PseCo [16] mitigates label noise through prediction-guided assignment and consistency voting. Methods like Mix Teacher [17] and Consistent-Teacher [18] improve scale robustness and reduce assignment inconsistencies, respectively. ARSL [19] jointly addresses selection and assignment ambiguities. More recently, the teacher–student paradigm has been extended to DETR-style set-based detectors. Semi-DETR [20] stabilizes SSOD for DETR by introducing stage-wise hybrid matching and enforcing cross-view query consistency under noisy pseudo labels. Sparse Semi-DETR [21] further improves DETR-style SSOD by refining object queries and applying reliable pseudo-label filtering to reduce supervision noise. STEP-DETR [22] introduces a Super Teacher and pseudo-label-guided text queries to alleviate confidence bias and enhance supervision reliability through denoising text-guided queries and query refinement. Despite these advances, applying generic SSOD methods to industrial surface defect detection remains challenging due to unique characteristics like texture ambiguity and pronounced class imbalance in detection difficulty, which are not explicitly addressed by current frameworks.

2.2. Feature Refinement

Feature refinement techniques are widely employed to recalibrate intermediate representations, suppressing irrelevant background information and enhancing discriminative cues. A dominant approach uses attention mechanisms to weight informative channels or spatial regions, as exemplified by SENet [23], CBAM [24], and their more efficient variant ECA-Net [25]. Beyond local attention, modeling long-range dependencies through global context, as in non-local networks [26] and GCNet [27], can also refine feature representations. For improved geometric adaptation, deformable convolutions DCNv2 [28] and spatially adaptive operators like SKNet [29] dynamically adjust receptive fields. Another line of work [30] leverages frequency or texture cues to amplify subtle patterns, which is particularly relevant for defect detection. FcaNet [31] incorporates frequency domain analysis into channel attention, and frequency-domain enhancement has proven effective for highlighting hard-to-distinguish defects [32,33]. Conditional computation methods, such as CondConv [34] and Dynamic Convolution [35], adapt network parameters to the input, enabling flexible feature enhancement. However, most general-purpose refinement modules are not explicitly designed for the shallow, texture-critical stages of a network where foreground–background confusion originates. This gap is especially consequential in semi-supervised settings, where noisy features can propagate into erroneous pseudo-labels, undermining the training stability for industrial defect detection.

2.3. Label Assignment

Label assignment is a core component in object detection, determining which anchor boxes or feature points are designated as positive samples for supervision. Traditional methods rely on hand-crafted rules, such as Intersection-over-Union (IoU) thresholds in anchor-based detectors [36,37] or geometric center constraints in anchor-free models like FCOS [38]. To reduce hand-crafted heuristic design, adaptive strategies have been developed, including ATSS [39] and learning-based methods like PAA, OTA [40,41], FreeAnchor, and AutoAssign [42,43]. The rise of set-based detectors, notably DETR [44], introduced one-to-one matching via the Hungarian algorithm, offering a distinct assignment paradigm. While label assignment focuses on training-time supervision, matching-based formulations have also been explored for prediction selection. For example, Differentiable NMS via Sinkhorn Matching [45] casts NMS as a differentiable matching problem, offering a related perspective on strengthening selection to improve detection robustness.

In semi-supervised object detection, the assignment problem extends to selecting and weighting pseudo-targets from unlabeled data, where predictions are inherently noisy. Methods have evolved to address this challenge. Soft Teacher [3] employs confidence-weighted supervision, while PseCo [16] refines assignment through consistency voting among proposals. Consistent-Teacher [18] enhances assignment robustness for anchor-based detectors, and ARSL [19] tackles ambiguity via joint confidence estimation. A common limitation of these approaches is their reliance on a global, class-agnostic confidence threshold for pseudo-label selection. This is suboptimal for industrial defect datasets where detection difficulty varies drastically across classes, causing easy classes to dominate the pseudo-label set and harder classes to be suppressed, ultimately leading to biased learning and class imbalance.

3. Proposed Method

3.1. Architecture Overview

Our proposed method, TaDP-Det, is built upon the teacher–student self-training paradigm for semi-supervised object detection. The framework utilizes both labeled and unlabeled data: for labeled images, the student model is trained with conventional supervised detection losses, while for unlabeled images, we adopt a consistency-based approach where the teacher model processes a weakly augmented view to generate pseudo-labels, which subsequently supervise the student model on a corresponding strongly augmented view. The teacher’s parameters are updated via an exponential moving average (EMA) of the student’s weights.

As illustrated in Figure 1, we propose two complementary mechanisms to enhance pseudo-label quality. First, we introduce feature-side enhancement via a Texture Enhance Module (TEM), which is integrated at the shallow C3 stage of both the teacher and student detectors. This module strengthens discriminative texture cues in low-level features, thereby mitigating foreground–background confusion and improving the reliability of pseudo-labels in texture-ambiguous regions. Second, we propose a score-side refinement through Class-wise Dynamic Pseudo-label Filtering (CDPF). This component analyzes the confidence distribution of teacher predictions per category to determine adaptive thresholds, selectively retaining high-quality pseudo-labels while filtering noise. During inference, the framework reduces to a single, efficient detector equipped with TEM, as the CDPF mechanism is only applied during training.

3.2. Texture Enhance Module: A Texture-Aware Patch-Level Mixture-of-Experts

Shallow feature maps are rich in textural information but are also prone to foreground–background confusion in industrial imagery, where defects often exhibit subtle, low-contrast patterns. This ambiguity propagates through the network, degrading the reliability of pseudo-labels generated by the teacher model. To address this, we propose a lightweight Texture Enhance Module (TEM) that refines shallow features through an adaptive, sparse mixture-of-experts (MoE) mechanism, guided by a texture-aware descriptor. As illustrated in Figure 2a, given an input feature map $[eqn]$ from the C3 stage, TEM operates in three stages: (1) constructing a texture descriptor, (2) tokenizing features and routing patches to experts, and (3) applying expert transformations and fusing the results. Industrial backgrounds are typically repetitive and homogeneous, dominated by smooth low-to-mid frequency variations. Defects instead appear as sparse local discontinuities with elevated high-frequency responses. This contrast can be unified as local-to-global inconsistency: the background aligns with a global texture prototype, while foreground defects deviate from it. TEM encodes this inconsistency into a texture-aware routing signal. Guided by it, experts are selectively applied to prototype-inconsistent regions, improving defect discrimination in feature representation and the reliability of teacher-generated pseudo labels.

3.2.1. Texture-Aware Descriptor Construction

We construct a texture-aware descriptor D to measure local deviations from a global texture prototype so that regions with potential anomalies can be highlighted. The overall pipeline is summarized in Figure 2b and detailed in Algorithm 1. Algorithm 1 Texture-aware Descriptor ConstructionRequire: C3 feature map $[eqn]$ ; number of quantization levels K; pooling operator $[eqn]$ .Ensure: texture-aware descriptor $[eqn]$ .

1: $[eqn]$ ; $[eqn]$ (Equation (1))
2: for all positions p on the $[eqn]$ grid do
3: $[eqn]$ (Equation (2))
4: end for
5: $[eqn]$ ; $[eqn]$ (Equation (3))
6: $[eqn]$ (Equation (3))
7: **for ** $[eqn]$ to K do
8: $[eqn]$ (Equation (3))
9: end for
10: **for ** $[eqn]$ to K do
11: for all positions p do
12: $[eqn]$ (Equation (4))
13: end for
14: end for
15: $[eqn]$ (Equation (5))
16: return D.

The input C3 feature X is first projected into a lower-dimensional local feature A via a 1 × 1 convolution. A global prototype vector g is then obtained by applying global average pooling (GAP) to A:

[eqn]

For each position p, the cosine similarity between the local feature $[eqn]$ and the global prototype g is computed, yielding a similarity map $[eqn]$ :

[eqn]

This similarity map is then softly quantized into K levels to capture the distribution of local texture consistency. Defining $[eqn]$ , $[eqn]$ , the quantization centers and interval are:

[eqn]

The K-level similarity map $[eqn]$ is obtained as

[eqn]

Finally, local average pooling $[eqn]$ is applied over a neighborhood $[eqn]$ to aggregate spatial context, followed by a 1 × 1 convolution to produce the final texture-aware descriptor D:

[eqn]

Intuitively, the cosine similarity map S measures local-to-global texture consistency, and the soft quantization discretizes this continuous signal into multiple consistency levels. After neighborhood pooling, these level responses form a spatially aware distributional summary, analogous to histogram-style statistical texture descriptors. As a result, regions that deviate from the dominant background texture produce distinctive patterns in the texture-aware descriptor D, which provides an effective cue for patch-level routing in the subsequent MoE.

3.2.2. Patch-Level Tokenization and Expert-Choice Routing

We concatenate the C3 feature X and the texture-aware descriptor D to form routing feature map Z:

[eqn]

Z is partitioned into $[eqn]$ non-overlapping patches of size $[eqn]$ , denoted $[eqn]$ . Each patch $[eqn]$ is tokenized via global average pooling (GAP) followed by a multi-layer perceptron (MLP):

[eqn]

A linear router computes an affinity score $[eqn]$ between each expert and each token $[eqn]$ :

[eqn]

We adopt an expert-choice routing strategy. Each expert selects the top-k patches with the highest affinity scores, forming its assignment set $[eqn]$ :

[eqn]

To keep the routed budget stable across resolutions, we parameterize k by a capacity factor $[eqn]$ :

[eqn]

We define a binary selection mask $[eqn]$ :

[eqn]

Since a patch can be assigned to multiple experts, we normalize the expert weight matrix across the selected experts for each patch:

[eqn]

Patches not selected by any expert $[eqn]$ , then $[eqn]$ bypass the expert computations entirely.

3.2.3. Patch-Wise Experts and Feature Fusion

Each expert is implemented as a lightweight depthwise-separable convolutional block $[eqn]$ . For a patch $[eqn]$ extracted from the original feature X, the expert’s output is

[eqn]

[eqn]

where GN denotes Group Normalization and $[eqn]$ is a non-linear activation function (SiLU).

The final refined feature for patch i is a gated sum of expert outputs:

[eqn]

Let $[eqn]$ denote the inverse operation of the non-overlapping patch extraction from X, which reconstructs a feature map by restoring each patch to its original spatial position. The texture refinement feature is then formed as

[eqn]

where patches not selected by any expert are set to $[eqn]$ , leaving the corresponding spatial regions unmodified. Finally, the TEM output feature is obtained via a residual addition:

[eqn]

where the $[eqn]$ convolution acts as a lightweight linear projection to recalibrate and mix channels of the fused expert refinements before residual injection.

3.3. Class-Wise Dynamic Pseudo-Label Filtering

TEM improves pseudo-label quality from the feature side. We further refine pseudo labels from the score side by addressing a practical issue in industrial defect detection: teacher confidence scores are class-dependent and training-stage dependent. Hard defect categories tend to receive systematically lower scores, while easy categories are often over-confident. Therefore, a static global threshold $[eqn]$ cannot simultaneously suppress false positives for easy classes and preserve true positives for hard ones. To mitigate this bias, CDPF learns a separate, training-adaptive threshold $[eqn]$ for each class using a lightweight probabilistic model.

3.3.1. Online Per-Class Score Buffer

For each class c, we maintain a First-In-First-Out (FIFO) buffer $[eqn]$ that stores the most recent $[eqn]$ confidence scores from teacher predictions assigned to class c on unlabeled data. At a given training iteration t, we collect the set of confidence scores $[eqn]$ from all teacher predictions on the current unlabeled batch that are classified as class c, where $[eqn]$ denotes the number of such predictions. The buffer is then updated by appending this new set $[eqn]$ , and if the total number of entries exceeds $[eqn]$ , discarding the oldest scores to maintain the fixed capacity:

[eqn]

Here, the operator $[eqn]$ retains only the $[eqn]$ most recent elements. This mechanism provides a compact, continuously refreshed approximation of the marginal score distribution $[eqn]$ throughout the training process.

3.3.2. Per-Class Score Modeling with a 1D GMM

The confidence scores produced by the teacher for a given class are not uniformly reliable during self-training. Instead, they typically comprise a mixture of low-score noisy predictions and a high-score subset that corresponds to more reliable pseudo labels. This mixture structure becomes more pronounced as training proceeds, and the teacher becomes better calibrated, which makes a fixed global threshold or a heuristic top-K rule suboptimal for class-wise filtering. In particular, using a fixed threshold tends to cause either overly sparse pseudo labels for hard classes or excessive false positives for easy classes, and the optimal threshold also drifts across training stages.

Motivated by this observation, we model the buffered per-class score distribution using a lightweight two-component 1D Gaussian Mixture Model (GMM) and derive an adaptive threshold from the fitted posterior. We adopt two components as the minimal choice to separate an unreliable mass from a more reliable mass in a 1D score space, rather than to perfectly fit the entire distribution. Using more components may increase sensitivity and become unstable under scarce samples; therefore, we keep the model minimal and pair it with explicit stability safeguards (Algorithm 2) to avoid unstable fitting for rare classes or early iterations. Algorithm 2 CDPF with Posterior Thresholding (Per Class c)Require: teacher scores $[eqn]$ for class c; buffer $[eqn]$ ; posterior level $[eqn]$ ; threshold floor $[eqn]$ ; minimum fit size $[eqn]$ .Ensure: updated $[eqn]$ and threshold $[eqn]$ .

1: Update buffer: $[eqn]$ (Equation (18))
2: **if ** $[eqn]$ then
3: return $[eqn]$ .
4: end if
5: Fit a two-component 1D GMM on $[eqn]$ via EM, obtaining $[eqn]$ and $[eqn]$ .
6: Set $[eqn]$ by assigning the higher-mean component as positive.
7: Compute $[eqn]$ for each $[eqn]$ using Equation (20).
8: **if ** $[eqn]$ then
9: return $[eqn]$ .
10: end if
11: $[eqn]$ (Equation (22))
12: $[eqn]$ (Equation (23))
13: return $[eqn]$ and updated $[eqn]$ .

We model the empirical score distribution for each class, as represented by the buffered scores $[eqn]$ , using a two-component 1D Gaussian Mixture Model (GMM). The density is given by:

[eqn]

where the superscripts $[eqn]$ and $[eqn]$ denote the negative (low-score) and positive (high-score) mixture components, respectively, with corresponding mixing weights satisfying $[eqn]$ . The parameters of Equation (19) are estimated from the buffer $[eqn]$ using the Expectation-Maximization (EM) algorithm. After fitting, we identify the positive component as the one with the higher mean value, i.e., $[eqn]$ .

3.3.3. Posterior-Based Threshold Adaptation

Given the fitted GMM parameters, the probability that a score s originates from the positive component is given by the posterior:

[eqn]

Our proposed CDPF selects a pseudo-label for class c by requiring this posterior to exceed a predefined confidence level $[eqn]$ :

[eqn]

The posterior level $[eqn]$ specifies the decision boundary on the fitted mixture and thus controls the precision–recall trade-off in pseudo-label selection. Under equal misclassification costs, the Bayes-optimal boundary corresponds to equal posterior probability of the two components, i.e., $[eqn]$ . This naturally yields the default choice $[eqn]$ . For efficient online filtering, we convert the posterior rule in Equation (21) into a class-wise score threshold $[eqn]$ by evaluating $[eqn]$ over the buffered scores and selecting the minimum score that satisfies the constraint:

[eqn]

As a result, although $[eqn]$ is fixed, the derived threshold $[eqn]$ is adaptive: it varies across classes and evolves across training stages as the buffered per-class score distribution changes. To avoid unreliable thresholds in early training or for rare classes with insufficient or poorly separated buffered scores, we apply two stability safeguards. We first enforce a conservative floor:

[eqn]

If the buffer is under-populated ( $[eqn]$ ) or the fitted posterior never reaches the required confidence ( $[eqn]$ ), we fall back to $[eqn]$ to avoid unreliable pseudo-label selection.

During training, teacher predictions for class c with scores $[eqn]$ are retained as pseudo-labels. The overall computational overhead is negligible, since the EM fitting is performed on a one-dimensional two-component mixture over a small buffer.

4. Experimental Results and Analysis

In this section, we first describe the experimental setup, including datasets, implementation details, and evaluation metrics, and then present and analyze the results to validate the effectiveness of the proposed TaDP-Det method.

4.1. Dataset

(1)NEU-DET [11]: To evaluate the proposed semi-supervised method, we first conduct experiments on NEU-DET, a benchmark dataset for surface defect detection on hot-rolled steel released by Northeastern University. It contains 1800 grayscale images of size $[eqn]$ , evenly distributed over six defect categories: crazing (Cr), inclusion (In), patches (Pa), pitted surface (PS), rolled-in scale (RS), and scratches (Sc), with 300 samples per class. We split the dataset into training and validation sets with a 7:3 ratio and then followed a standard semi-supervised protocol by sampling 10%, 5%, and 1% of the training set as labeled data, treating the remaining training images as unlabeled.(2)GC10-DET [12]: To further assess the generalization ability of our method, we also perform experiments on GC10-DET, a benchmark for metallic surface defect detection. GC10-DET contains 2300 grayscale images with a resolution of $[eqn]$ , covering 10 defect categories, including punching (pu), welding line (wl), crescent gap (cg), water spot (ws), oil spot (os), silk spot (ss), inclusion (in), rolled pit (rp), crease (cr), and waist folding (wf). Compared with NEU-DET, GC10-DET exhibits greater diversity in defect appearance, with larger variations in scale, shape, and background texture, making it a more challenging benchmark for robustness evaluation. We adopt a semi-supervised setting with 10% and 5% of the training images used as labeled data, and the remaining training images treated as unlabeled.(3)PCB-DEFECT [13]: We additionally evaluate our method on PCB-DEFECT, a dataset for defect detection on printed circuit boards. It contains 693 RGB images with a resolution of $[eqn]$ and 2953 annotated defect instances from six categories: missing hole (MH), mouse bite (MB), open circuit (OC), short (SH), spur (SP), and spurious copper (Sc). Each image may contain multiple instances of the same defect type, reflecting the fine-grained and small-scale anomalies commonly encountered in PCB inspection. We adopt a semi-supervised setting with 10% and 5% of the training set as labeled data, treating the remaining training images as unlabeled.

4.2. Experimental Setup

(1)Implementation Details: TaDP-Det is implemented on the MMDetection framework, built upon a Faster R-CNN detector with a ResNet-50 backbone. We use SGD with an initial learning rate of 0.001, a weight decay of 0.0001, and a gradient norm clipping of 0.1. Training follows an iteration-based schedule of 18 k iterations with step decays at 12 k and 16 k iterations. The teacher is updated by EMA with momentum 0.999. For TEM, we set the texture descriptor dimension to $[eqn]$ and the quantization bin number to $[eqn]$ . Patch-level expert-choice routing is applied on an approximately $[eqn]$ patch grid with $[eqn]$ experts, and the routing budget is controlled by the capacity factor c with $[eqn]$ . For CDPF, we maintain a per-class FIFO score buffer of size $[eqn]$ and fit a two-component 1D GMM by EM. Posterior thresholding uses $[eqn]$ , and we apply a minimum floor $[eqn]$ to stabilize early training. All experiments are conducted on Ubuntu 20.04 with PyTorch (1.13.1) and MMDetection (3.3.0), using two NVIDIA RTX 3090 GPU (24 GB), an Intel Xeon W-2255 CPU (3.70 GHz), and 64 GB RAM.(2)Evaluation Metrics: In this work, we primarily adopt the Average Precision at IoU = 0.5 (AP_0.5_) and its mean over all defect categories, mAP_0.5_, as the main accuracy metrics. AP_0.5_ measures the area under the precision–recall curve for each class at an IoU threshold of 0.5, reflecting both classification accuracy and localization quality, while mAP_0.5_ summarizes the overall detection performance across categories. To further assess the practicality of the proposed method in industrial scenarios, we additionally report FPS, Params (M), and GFLOPs. FPS is measured under the same inference setting for all methods to reflect practical throughput, while Params (M) and GFLOPs indicate the model size and computational complexity, respectively. All metrics are evaluated under the semi-supervised settings specified for each dataset.

4.3. Analysis of Results

We compare TaDP-Det with representative semi-supervised object detection (SSOD) baselines on three industrial defect datasets: NEU-DET [11], GC10-DET [12], and PCB-DEFECT [13] under multiple labeled-data ratios. Following a common detection pipeline taxonomy, the compared methods are grouped into three categories: (1) End-to-end methods: Semi-DETR [20], Sparse Semi-DETR [21], and STEP-DETR [22]; (2) One-stage methods: DSL [15], Unbiased Teacher v2 [14], ARSL [19], and Consistent Teacher [18]; and (3) Two-stage methods: Soft Teacher [3], PseCo [16], and Mix Teacher [17]. For each category, we also report the performance of a labeled-only fully supervised counterpart (marked with *), including DINO-DETR [46], FCOS [38], and Faster R-CNN [36]. This serves as a baseline to quantify the performance gain afforded by semi-supervised learning. The results are summarized in Table 1, Table 2, Table 3 and Table 4. Below, we first present overall observations, followed by detailed dataset-wise analyses.

Overall observation. Table 1, Table 2, Table 3 and Table 4 demonstrate that TaDP-Det consistently achieves state-of-the-art performance across all three industrial benchmarks under various labeled-data ratios. The improvements are most pronounced in extremely low-label regimes and on texture-ambiguous or structure-sensitive defect categories. This suggests that (i) the Texture Enhance Module (TEM) effectively strengthens discriminative shallow features for the teacher model, and (ii) the Class-wise Dynamic Pseudo-label Filtering (CDPF) mitigates class-dependent score bias, thereby retaining informative pseudo-labels for hard categories while suppressing noise for easy ones.

(1)NEU-DET. As shown in Table 1, TaDP-Det achieves the highest mAP_0.5_ under all labeled ratios. With only 1% labeled data, TaDP-Det reaches 53.2 mAP_0.5_, outperforming the strongest two-stage SSOD baseline, Mix Teacher (45.7), by +7.5 points and exceeding PseCo (44.0) by +9.2 points. When the labeled ratio increases to 5% and 10%, respectively, it maintains its leading position. Notably, the margin at 5% becomes smaller (65.6 vs. 65.3 for Mix Teacher), indicating that as supervision becomes less scarce, different teacher–student frameworks start to converge. Nevertheless, TaDP-Det remains consistently superior, demonstrating that the proposed modules provide persistent gains rather than only “rescuing” the extremely low-label setting.

Hard-category gains and texture ambiguity. A class-wise inspection highlights that the main improvements in NEU-DET come from the most challenging defect types, especially Crazing (Cr). Under the 1% setting, Cr AP jumps to 25.8 with TaDP-Det, while Mix Teacher and PseCo achieve only 4.2 and 11.4, respectively. Similar advantages remain at 5% and 10% labeled ratios (e.g., 45.1 AP for Cr at a ratio of 10%). This behavior aligns with the defect characteristics: Cr patterns are typically subtle, low-contrast, and heavily entangled with background textures. By inserting TEM at the shallow C3 stage, where features are texture-dominated, the teacher is guided to disambiguate foreground texture from background streaks, producing more complete and less noisy pseudo boxes for Cr and other texture-confusable categories.

Balanced supervision via class-adaptive filtering. Beyond Cr, TaDP-Det also strengthens categories such as PS and RS in the 1% regime (e.g., PS increases from 53.5 to 66.7 over Mix Teacher at 1%), while preserving strong performance on visually salient categories (Pa, Sc). This more balanced per-class AP profile is a key advantage over global-threshold methods (e.g., Soft Teacher and PseCo), which often over-select pseudo labels from easy classes and under-select from hard classes, yielding skewed supervision. CDPF explicitly addresses this issue by learning class-wise, training-adaptive thresholds, thus preventing hard classes from being starved of pseudo-supervision and avoiding excessive false positives for easy classes. Consequently, TaDP-Det not only improves overall mAP_0.5_ but also redistributes pseudo-label quality across categories toward a more class-fair and texture-aware learning signal.

Efficiency and real-time performance. We compare TaDP-Det with representative baselines on NEU-DET (10% labeled) in terms of inference throughput (FPS), parameter size (Params), and computational cost (GFLOPs) under the same inference setting (Table 2). TaDP-Det achieves the best accuracy (71.8 mAP_0.5_) while maintaining competitive throughput (34.3 FPS). Among all SSOD baselines, Mix Teacher is the strongest in accuracy (70.6 mAP_0.5_), yet TaDP-Det improves mAP_0.5_ by +1.2 and runs 1.20× faster (34.3 vs. 28.5 FPS). TaDP-Det introduces a modest increase in Params/GFLOPs (45.94M/159 GFLOPs), mainly due to the additional experts in TEM. Importantly, TEM adopts sparse expert-choice routing, so expert computation is only applied to a subset of patches rather than the full feature map at every layer, which helps retain practical throughput. Meanwhile, CDPF is used only during training and is removed at inference, incurring no extra runtime cost. Overall, these results indicate that TaDP-Det provides consistent accuracy gains without sacrificing practical throughput, supporting its feasibility for time-sensitive industrial inspection scenarios.

Qualitative results on NEU-DET. Figure 3 presents qualitative results comparing TaDP-Det with representative semi-supervised object detection (SSOD) baselines on the NEU-DET dataset under a 10% labeled data regime. For texture-ambiguous, low-contrast defects such as the sample Crazing_4, our method generates more complete and precise predictions that align closely with the true defect boundaries. In contrast, baseline approaches (e.g., DSL, Consistent Teacher, Mix Teacher) often yield confused or fragmented detections, demonstrating a limited ability to interpret the noisy, irregular patterns characteristic of this defect class. When presented with multiple adjacent defects, as in sample Pitted_surface_172, our detector successfully distinguishes between individual instances and maintains clear inter-category boundaries. Competing methods (e.g., Consistent Teacher, Soft Teacher), however, frequently produce overlapping bounding boxes that fail to separate distinct defective regions. For fine-structured defects like Scratches_6, TaDP-Det accurately localizes the delicate, elongated structure as a unified component. Conversely, baseline models tend to generate either fragmented detections or mislocalized predictions (e.g., PseCo), failing to capture the defect in its entirety. Overall, across all defect categories, TaDP-Det demonstrates superior robustness compared with conventional global-threshold teacher–student frameworks (e.g., Soft Teacher, PseCo, Mix Teacher). It produces significantly fewer spurious detections, reduces class confusion, and successfully suppresses false positives on visually similar background textures. This performance corroborates the effectiveness of our class-adaptive pseudo-label selection mechanism in stabilizing training and improving discriminative capability across diverse categories. These qualitative observations provide a visual explanation for the quantitative performance gains documented in Table 1, particularly under extremely low-data conditions.

(2)GC10-DET. Table 3 shows that TaDP-Det also achieves the best performance on GC10-DET, which features larger resolution, greater intra-class variation, and more complex metallic textures. With 5% labeled data, TaDP-Det reaches 50.2 mAP_0.5_, surpassing PseCo (48.4) by +1.8 and Mix Teacher (45.9) by +4.3. With 10% labeled data, TaDP-Det further improves to 58.2 mAP_0.5_, outperforming Mix Teacher (55.7) by +2.5 and PseCo (50.9) by +7.3. The widening margin over PseCo at 10% suggests that improving pseudo-label quality (rather than simply increasing pseudo quantity) becomes increasingly important when the student already has a stronger supervised anchor.

Where the gains come from. Per-class results indicate that TaDP-Det brings particularly large improvements in categories that are easily corrupted by background texture or exhibit strong score bias. For instance, at 10% labeled data, TaDP-Det boosts waist folding (wf) to 68.7 AP (vs. 60.2 for PseCo and 49.8 for Mix Teacher) and rolled pit (rp) to 38.9 AP (vs. 20.7 for PseCo and 25.2 for Mix Teacher), which are substantial gains. These categories often manifest as elongated folds or subtle pits embedded in repetitive metallic patterns; they are prone to incomplete localization and under-confident scores early in training. TEM helps the teacher better capture low-level discriminative cues, while CDPF prevents discarding these hard-class predictions solely due to lower raw scores.

Trade-offs and net benefit. We also observe that TaDP-Det does not monotonically improve every category (e.g., some classes, such as welding line (wl), show competitive but not always the highest AP against certain baselines). This is expected in SSOD because class-wise thresholding and improved hard-class recall may slightly reduce the pseudo-label density for some easy categories. However, the overall mAP improves because the additional gains concentrate on the long-tail/hard categories that dominate the error budget under low-label supervision. In other words, TaDP-Det reallocates learning capacity from over-confident easy-class pseudo labels toward under-served hard classes, yielding a better global optimum.

(3)PCB-DEFECT. As shown in Table 4, TaDP-Det achieves the highest mAP_0.5_ on PCB-DEFECT under both ratios. With 5% labeled data, TaDP-Det obtains 89.8 mAP_0.5_, exceeding Semi-DETR (88.2) by +1.6 and surpassing Soft Teacher (83.1) and PseCo (85.9) by clear margins. With 10% labeled data, TaDP-Det further improves to 92.7 mAP_0.5_, remaining the best-performing method.

Structure-sensitive defects and shallow cues. PCB defects are often thin, small, and highly structured (e.g., breaks/bridges along traces), making them sensitive to edge continuity and local texture. TaDP-Det yields great improvements in such categories: at 5% labeled data, Open Circuit (OC) reaches 99.1 AP and Short (SH) reaches 86.6 AP, both higher than strong baselines; at 10%, SH further rises to 87.6 AP. These improvements corroborate the role of TEM at C3: enhancing early structural cues helps the teacher generate more complete pseudo boxes for fine-grained patterns, which is crucial when defects occupy only a tiny portion of high-resolution PCB images.

Class-dependent confidence and robust pseudo selection. PCB categories also exhibit different confidence distributions (e.g., some defects are visually salient while others are subtle and easily confused with traces), making a global pseudo threshold suboptimal. CDPF alleviates this by adapting thresholds per class and per training stage, thereby retaining informative pseudo labels for low-score classes and reducing noise from over-confident ones. Although a few categories may show comparable AP to the best baseline (and occasional small decreases in certain easy classes), the overall mAP gains indicate that TaDP-Det improves the quality–diversity of pseudo supervision and stabilizes semi-supervised training for high-resolution, multi-instance PCB inspection.

Summary across datasets. Across NEU-DET, GC10-DET, and PCB-DEFECT, the consistent pattern is that TaDP-Det improves performance primarily by (i) strengthening texture and structure cues in shallow features to reduce foreground–background confusion (TEM), and (ii) correcting class-wise score bias to obtain more reliable and balanced pseudo labels (CDPF). This combination is particularly effective for industrial defect detection, where background textures are strong, defects are subtle, and teacher confidence is inherently class-dependent.

4.4. Ablation Study

In this section, we conduct ablation studies to validate each component of TaDP-Det. Unless otherwise specified, all results are reported on NEU-DET under the semi-supervised setting with 10% labeled and 90% unlabeled data. For fair comparison, all variants are trained with the same number of iterations and identical optimization settings. We report inference-time complexity in terms of Params (M) and GFLOPs measured on a single deployed detector with an input size of $[eqn]$ . For semi-supervised variants, we use the EMA teacher for evaluation. For the 10% supervised baseline, we evaluate the trained Faster R-CNN directly.

Since inference uses only one detector, any increase in Params/GFLOPs is attributable solely to the TEM; the CDPF is applied only during training and thus introduces no inference overhead. Results are summarized in Table 5.

4.4.1. Baseline Network

Under the 10% labeled setting, training Faster R-CNN with labeled images only yields 56.7% mAP on NEU-DET (Table 5), indicating that limited annotations are insufficient to learn robust defect representations in industrial surface scenarios. We then adopt a standard teacher–student SSOD framework with weak–strong augmentation, where the teacher generates pseudo labels for unlabeled images to supervise the student, and the teacher is updated as an EMA of the student. By leveraging unlabeled data, the mAP is boosted to 65.1%, which serves as our SSOD baseline for subsequent comparisons.

4.4.2. Effect of TEM

(1)Quantitative improvement. We introduce the proposed Texture Enhance Module (TEM) at the shallow C3 stage for both teacher and student to alleviate foreground–background confusion under texture-ambiguous and low-contrast backgrounds. TEM strengthens low-level texture cues via a texture-aware patch-level mixture-of-experts, thereby improving pseudo-label reliability in low-contrast, texture-ambiguous areas. As reported in Table 5, TEM improves the SSOD baseline from 65.1% to 68.5%. In terms of inference complexity, adding TEM to the deployed detector increases Params/GFLOPs from 41.374M/131 GFLOPs to 45.944M/159 GFLOPs under $[eqn]$ input.(2)Qualitative evidence. Figure 4 shows C3 activation maps on three representative NEU-DET samples. For the scratch case (scratches_6, row 1), the baseline responses are scattered and partially activated by background textures, whereas TEM produces a more continuous activation aligned with the elongated scratch structure. For the inclusion example (inclusion_81, row 2), TEM strengthens the weak defect cues in low-contrast areas, making the defect region more salient and spatially coherent. For the crazing sample (crazing_4, row 3), TEM suppresses scattered texture-induced responses and yields a more structured activation pattern, indicating improved robustness under fine-grained texture variations. Overall, these visual patterns indicate that TEM mitigates shallow-level foreground–background confusion, consistent with the quantitative gain in Table 5.(3)Position ablation. To determine the optimal position of TEM at the backbone in extracting features, we evaluate three variants by inserting TEM at the C2, C3, and C4 stages, while keeping the SSOD training protocol and all hyperparameters unchanged. All results are reported as mAP_0.5_ on NEU-DET with 10% labeled data. As shown in Table 6, inserting TEM at the C3 stage achieves the best performance, reaching 68.5 mAP_0.5_ (+3.4 over the SSOD baseline of 65.1). In comparison, the C2 stage yields 67.3 (+2.2), and the C4 stage yields 66.2 (+1.1). This trend aligns with TEM’s design: the texture-aware descriptor estimates local-to-global texture consistency and guides patch-wise expert refinement, which requires both sufficient local detail and reasonably stable semantic context. At the C2 stage, responses are dominated by very early cues and are more susceptible to noise and background clutter, making the estimated consistency signal less stable and the routing less reliable. At the C4 stage, features are more invariant but spatially coarser due to down-sampling, where subtle texture inconsistencies are partly smoothed out, leaving limited room for patch-level expert refinement. In contrast, the C3 stage provides a better balance between spatial resolution and stable texture representation, resulting in the most reliable descriptor and the largest gain.(4)Effect of different routing strategies. Keeping the expert block $[eqn]$ , the texture-aware descriptor $[eqn]$ , and all training settings fixed, we vary only the routing scheme to explore the effect of different routing strategies in TEM. We compare (i) soft routing [47], which evaluates all experts for each patch and fuses their outputs with gating weights; (ii) token-choice routing [48], which assigns each patch to its top-1 expert; and (iii) our expert-choice routing, where each expert selects patches under a capacity budget $[eqn]$ . As shown in Table 7, all three routing schemes improve the SSOD baseline of 65.1 mAP_0.5_, confirming the effectiveness of patch-level expert refinement. Token-choice yields 67.2 mAP_0.5_ with the lowest cost (142 GFLOPs), whereas soft routing achieves 67.9 mAP_0.5_ but is the most expensive (175 GFLOPs) since all experts are evaluated. In contrast, our expert-choice routing reaches the best accuracy of 68.5 mAP_0.5_ with a moderate cost (159 GFLOPs), indicating our proposed capacity-aware expert selection provides a better accuracy–efficiency trade-off than both dense fusion and per-token hard assignment.(5)Comparison with various texture refinement modules. To verify that TEM’s gain is not merely due to adding an extra refinement block at the shallow stage, we compare TEM with three widely used texture refinement modules on the same C3 feature map: CBAM [24], CoordAtt [30], and FcaNet [31]. For fairness, each module is inserted at the C3 stage for both teacher and student while keeping the SSOD training/inference protocol unchanged, and we report mAP_0.5_ on NEU-DET (10% labeled). CBAM sequentially applies channel and spatial attention to enhance generic saliency, which is particularly helpful for relatively more “region-salient” defects, and this is consistent with its strongest per-class gain on the Inclusion class in Table 8. CoordAtt embeds positional information into channel attention via coordinate encoding, favoring elongated structures with strong directional continuity; correspondingly, it performs best on the Scratches class. FcaNet enriches channel descriptors with multi-frequency (DCT) components, making it more sensitive to fine-grained texture statistics, which aligns with its advantage on the pitted-surface class. As shown in Table 8, these general modules yield consistent but limited gains over the SSOD baseline, suggesting that uniform dense recalibration over the entire feature map is helpful but insufficient for texture-ambiguous defects. In contrast, our TEM reaches 68.5 mAP_0.5_ with a modest overhead, indicating that the key benefit comes from descriptor-guided, patch-wise expert refinement. Specifically, the texture descriptor highlights local deviations from the dominant background texture, and the routing mechanism focuses stronger transformations on the most ambiguous patches, thereby reducing shallow foreground–background confusion more effectively than general attention-style refinement.

4.4.3. Effect of CDPF

(1)Quantitative improvement. We further apply the proposed Class-wise Dynamic Pseudo-label Filtering (CDPF) to tackle class-dependent confidence bias in pseudo labeling, where a fixed global threshold tends to over-select easy classes while discarding hard-class positives. Specifically, CDPF fits a lightweight 1D two-component GMM to the per-class pseudo-label score distribution. It then derives a class-specific threshold $[eqn]$ from the posterior decision boundary. As shown in Table 5, CDPF on top of TEM further improves the mAP from 68.5% to 71.8%. Since CDPF is applied only during training, it introduces no additional inference overhead.(2)Qualitative evidence. Figure 5 visualizes the per-class teacher confidence distributions and the fitted two-component GMMs. For an easy class (left), scores concentrate around a high-confidence mode, and both the global threshold and the class-wise threshold lie in the high-score region, retaining most positives. For a hard class (right), the high-score component shifts left and overlaps substantially with the low-score component; a fixed global threshold would discard many medium-confidence yet potentially correct positives. By selecting the posterior decision boundary between the two components, CDPF sets a lower class-adaptive threshold, preserving hard-class positives while still filtering low-score noise.(3)Compare with non-GMM filtering strategies. We further compare CDPF with several commonly used non-GMM filtering strategies to demonstrate the effectiveness of our class-wise GMM-based thresholding. Specifically, we consider (i) a fixed global confidence threshold $[eqn]$ (following the Soft Teacher [3] setting as $[eqn]$ ), (ii) top-q filtering that retains the top $[eqn]$ pseudo labels ranked by teacher confidence, and (iii) the mean + std. rule that sets an adaptive threshold using the confidence statistics, i.e., $[eqn]$ . Note that all compared strategies are applied only during training and do not modify the deployed detector, thus introducing no additional inference-time overhead. As shown in Table 9, the fixed-threshold baseline achieves 68.5 mAP_0.5_. Top-q filtering yields a modest improvement to 69.3 mAP_0.5_ by keeping more medium-confidence pseudo labels, while Mean+Std further improves to 70.7 mAP_0.5_ due to its distribution-adaptive thresholding. In contrast, our proposed CDPF achieves the best performance of 71.8 mAP_0.5_ by modeling per-class teacher score distributions and deriving a class-specific threshold $[eqn]$ from the posterior decision boundary, which better mitigates class-dependent confidence bias while suppressing low-score noise.

4.4.4. Hyperparameter-Sensitive Analysis

For completeness and reproducibility, we conduct a one-factor-at-a-time sensitivity analysis of key hyperparameters in TEM and CDPF on NEU-DET under the 10% labeled semi-supervised setting. Unless otherwise stated, we vary one hyperparameter while fixing the others to the default configuration ( $[eqn]$ for TEM; $[eqn]$ for CDPF).

(1)Analysis of TEM Hyperparameter Sensitivity. Table 10 reports the sensitivity of TaDP-Det to the number of quantization bins K, patch size P, capacity factor c, and expert count E under the one-factor-at-a-time protocol. Overall, performance varies moderately within the tested ranges, indicating that TEM is not overly sensitive to these hyperparameters and the default configuration provides a practical accuracy–efficiency trade-off. Specifically, the number of quantization bins K, controls the resolution of texture-aware discretization: a small K may under-represent subtle texture variations, while a larger K offers limited additional benefit and can lead to sparser assignments. Patch size P determines the granularity of routing (token grid $[eqn]$ ;): a smaller P enables finer-grained routing but is more susceptible to local noise, whereas a larger P yields coarser tokens that may mix defect and background clutter. Capacity factor c adjusts the routed computation budget: increasing c from 1.0 to 2.0 brings a clear gain, while further increasing it to 3.0 yields only a marginal improvement at additional computational cost; thus, we set $[eqn]$ by default. Similarly, increasing the expert count E can improve performance by encouraging specialization. However, the modest improvement from $[eqn]$ to $[eqn]$ (Table 10) involves extra computation; thus, $[eqn]$ is used in our method by default.(2)Analysis of CDPF Hyperparameter Sensitivity. Table 11 further analyzes CDPF hyperparameters. Buffer size $[eqn]$ controls how much recent score history is used for online per-class score modeling: a smaller $[eqn]$ can produce noisier estimates, whereas an excessively large $[eqn]$ may incorporate stale history and reduce responsiveness as training progresses. Therefore, we set $[eqn]$ by default, as it provides sufficiently smooth per-class score modeling while remaining responsive to distribution shifts during training. Posterior level $[eqn]$ in Equation (21) defines the decision boundary on the fitted mixture distribution and governs the precision–recall trade-off: smaller $[eqn]$ admits more pseudo-labels (higher recall but potentially more false positives), while larger $[eqn]$ is more conservative (higher precision with fewer pseudo-labels). Thus, we use $[eqn]$ by default as a balanced posterior criterion that avoids overly aggressive pseudo-label admission and overly conservative filtering, leading to stable pseudo-label quality throughout training. Safeguard floor $[eqn]$ in Equation (23) prevents overly low or unstable class-wise thresholds, especially in early training or for rare classes: a smaller $[eqn]$ is more permissive but may retain unreliable pseudo-labels, whereas a larger $[eqn]$ is more conservative and may reduce pseudo-label coverage for hard classes. Accordingly, $[eqn]$ is adopted by default as an effective safeguard against unreliable low-confidence pseudo-labels without excessively shrinking pseudo-label coverage. Overall, CDPF remains stable within the tested ranges, and the default setting offers a consistent balance.

4.5. Analysis of Illumination Effects

In real industrial inspection, illumination conditions may fluctuate due to exposure variations, reflective surfaces, or non-uniform lighting. Such changes can cause under-exposed and low-contrast images, weakening defect visibility and potentially affecting feature activation and detection confidence. To provide qualitative evidence of robustness under illumination variations, we select representative samples from NEU-DET that include both relatively well-illuminated and under-exposed cases for the Inclusion and Patches categories and visualize the ground truth, intermediate feature response, and final predictions in Figure 6.

For Inclusion_229 (well-illuminated), the inclusion defects are clearly distinguishable from the background, and the feature response exhibits a strong and compact activation aligned with the defect region. The detector accurately localizes the inclusion with high confidence (In:96.8). In contrast, Inclusion_39 (under-exposed) shows much weaker contrast and ambiguous boundaries in the raw image. Although the overall activation becomes less distinct and more sensitive to local intensity variations, the intermediate response map still highlights defect-related elongated structures, enabling TaDP-Det to localize the main inclusion regions with relatively high confidence (e.g., In:85.2, In:88.1, and In:94.6). We also observe an additional low-confidence prediction (In:54.3), indicating that severe low illumination may introduce local ambiguities and occasionally trigger spurious responses. Nonetheless, the principal defect instances remain consistently activated and correctly detected. A similar trend is observed for the Patches category. In Patches_270 (better illumination), the response map concentrates strongly on the patch regions, and the detector produces confident and accurate localizations (Pa:99.8 and Pa:99.6). For Patches_219 (under-exposed), the defect lies in a darker area with stronger background texture and intensity variation. Compared with the brighter case, the feature activation becomes weaker and less separable from the textured background, and the detection confidence decreases (e.g., Pa:92.4), while the predicted box still aligns well with the ground truth.

Overall, these results suggest that illumination variations can affect intermediate feature response and may reduce prediction confidence under under-exposure. Nevertheless, TaDP-Det remains largely robust in terms of localization accuracy, producing predictions that are consistent with the ground truth under various illumination conditions.

4.6. Estimated Annotation Savings and Practical Value

A key practical motivation for semi-supervised defect detection is to reduce the high cost of manual annotation in industrial inspection. This challenge is widely acknowledged in industrial vision: acquiring data samples and reliable labels is difficult and costly in real-world deployments [49]. In many scenarios, annotations must be produced by trained experts, where each label may involve non-trivial judgment developed through years of domain experience. Although outsourcing is often adopted to scale up annotation, it may introduce quality issues such as incomplete or noisy labels, thereby increasing the burden of auditing, quality control, and even re-annotation. Such imperfect supervision can further degrade model performance and hinder robust deployment in industrial settings.

Under the standard SSOD protocols adopted in this paper, labeling only 1%/5%/10% of the training set corresponds to an approximate 99%/95%/90% reduction in manually annotated training images compared with fully supervised learning. For instance, with a 7:3 train/val split, NEU-DET provides about 1260 training images. Thus, the 10% protocol requires labeling only 126 images (and 63/13 images for 5%/1%, respectively), while the remaining images can be exploited as unlabeled data for self-training. In real production lines, inspection datasets often scale to hundreds of thousands or even millions of images. At such scales, even a 1% labeling regime translates into substantial savings and enables practitioners to concentrate expert effort on a small labeled core set and targeted auditing and quality control. Overall, TaDP-Det directly supports this objective by improving detection accuracy in low-label regimes, making industrial deployment more annotation-efficient.

5. Conclusions

In this study, we address the problem of semi-supervised industrial surface defect detection, where the scarcity of reliable annotations and the presence of complex, low-contrast textures present inherent challenges to pseudo-label quality. We identify two principal factors that degrade pseudo-label reliability in this setting: shallow feature maps that suffer from foreground–background confusion and class-dependent detection difficulty that renders global confidence thresholds suboptimal.

To overcome these limitations, we propose TaDP-Det, a semi-supervised detector designed to improve pseudo-label quality from both feature and score perspectives. On the feature side, we introduce a Texture Enhance Module (TEM), implemented as a texture-aware patch-level mixture-of-experts at a shallow backbone stage, to explicitly reinforce low-level texture cues and allocate computational resources to texture-ambiguous regions. On the score side, we devise a class-wise dynamic pseudo-label filtering (CDPF) scheme based on lightweight one-dimensional Gaussian mixture models, which adaptively derives category-specific confidence thresholds from per-class score distributions and better retains intrinsically challenging defects.

Comprehensive experiments on three representative industrial datasets: NEU-DET, GC10-DET, and PCB-DEFECT, demonstrating that TaDP-Det consistently outperforms strong semi-supervised object detection baselines in terms of mAP, while incurring only modest computational overhead. These results validate that jointly enhancing texture-aware feature representation and performing class-adaptive pseudo-label filtering constitutes an effective and practical approach for semi-supervised industrial surface defect detection.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Luo Q. Fang X. Su J. Zhou J. Zhou B. Yang C. Liu L. Gui W. Tian L. Automated visual defect classification for flat steel surface: A survey IEEE Trans. Instrum. Meas.2020699329934910.1109/TIM.2020.3030167 · doi ↗
2Luo Q. Fang X. Liu L. Yang C. Sun Y. Automated visual defect detection for flat steel surface: A survey IEEE Trans. Instrum. Meas.20206962664410.1109/TIM.2019.2963555 · doi ↗
3Xu M. Zhang Z. Hu H. Wang J. Wang L. Wei F. Bai X. Liu Z. End-to-end semi-supervised object detection with soft teacher Proceedings of the IEEE/CVF International Conference on Computer Vision Montreal, QC, Canada 10–17 October 202130603069
4Liu Y.C. Ma C.Y. He Z. Kuo C.W. Chen K. Zhang P. Wu B. Kira Z. Vajda P. Unbiased Teacher for Semi-Supervised Object Detectionar Xiv 202110.48550/ar Xiv.2102.094802102.09480 · doi ↗
5Zhang J. Su H. Zou W. Gong X. Zhang Z. Shen F. CADN: A weakly supervised learning-based category-aware object detection network for surface defect detection Pattern Recognit.202110910757110.1016/j.patcog.2020.107571 · doi ↗
6Zhou M. Su Z. Li M. Wang Y. Li G. CSDD-Net: A cross semi-supervised dual-feature distillation network for industrial defect detection Knowl. Based Syst.202430611275110.1016/j.knosys.2024.112751 · doi ↗
7Xiao H. Zhao C. Zhang Z. A semi-supervised method for steel surface defect detection based on soft-teacher Acad. J. Comput. Inf. Sci.20236111910.25236/ajcis.2023.060302 · doi ↗
8Ge J. Qin Q. Song S. Jiang J. Shen Z. Unsupervised selective labeling for semi-supervised industrial defect detection J. King Saud Univ. Comput. Inf. Sci.20243610217910.1016/j.jksuci.2024.102179 · doi ↗