SCFM-DETR: an enhanced transformer-based method for automated maize disease detection in field environments

Sasa Tian; Zhiqing Tao; Ke Li; Yuan Rao; Xianhong Xie; Yuan Yuan; Jun Zhu

PMC · DOI:10.1186/s13007-026-01507-8·February 16, 2026

SCFM-DETR: an enhanced transformer-based method for automated maize disease detection in field environments

Sasa Tian, Zhiqing Tao, Ke Li, Yuan Rao, Xianhong Xie, Yuan Yuan, Jun Zhu

PDF

Open Access

TL;DR

A new lightweight model called SCFM-DETR improves maize disease detection in field conditions with high accuracy and reduced computational needs.

Contribution

The novel SCFM-DETR model introduces an improved backbone and adaptive feature fusion module for efficient and accurate maize disease detection.

Findings

01

SCFM-DETR achieves 96.7% average precision and 95.8% recall for maize disease detection.

02

The model reduces parameters and computational load by 47% and 49%, respectively, compared to the baseline.

03

The model is suitable for deployment in computationally limited agricultural environments.

Abstract

Maize is susceptible to various diseases throughout its growth cycle, which can significantly reduce yields. The accurate identification of maize diseases with similar symptomatic manifestations is particularly challenging under field conditions due to heterogeneous lighting and variable weather conditions. This paper proposes a novel detection model named SCFM-DETR, which is based on an improved Real-Time DEtection TRansformer (RT-DETR) to achieve robust identification of maize diseases in complex environments. SimAM-StarNet is employed as the backbone for feature extraction in this model, reducing the number of parameters and improving multiscale feature fusion, thereby diminishing the impact of background noise. Furthermore, the original RepC3 module is replaced with a newly designed CGLU-FasterBlock-MANet (CFM) module, which enhances adaptive feature fusion for finer discriminative…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Malus domestica(apple · species)

Chemicals1

CFM

Diseases7

infection spot corn leaf disease maize disease ear rot leaf occlusion common rust

Figures12

Click any figure to enlarge with its caption.

Comparison of the SCFM-DETR with mainstream models in terms of the precision (a), recall (b), and mAP (c)

Detection results of the SCFM-DETR and other models

Recognition results of the SCFM-DETR and other models

Maize disease image dataset. a Illustration of the dataset categories. b, c Examples from the dataset representing different scenarios

Schematic diagram of the SCFM-DETR structure

Schematic diagram of the SimAM-StarNet structure

Comparison of the attention mechanisms: channel, spatial, and SimAM

Comparison of the validation loss (a) and mAP (b) curves between RT-DETR-R18 and the SCFM-DETR during training

Comparison of the precision (a), recall (b), and mAP (c) across categories for the ablation experiments in each group

SCFM-DETR ablation experiment detection results

Funding4

—The study was sponsored by The Special Fund for Anhui Agriculture Research System
—The National Natural Science Foundation of China
—The Key Research and Development Plan of Anhui Province
—Natural Science Research Pro Committee

Keywords

Maize disease detectionRT-DETRStarNetObject detection

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Agriculture and AI · Advanced Data and IoT Technologies · Plant Disease Management Techniques

Full text

Introduction

Maize ranks among the most widely cultivated crops globally, leading in terms of both production area and yield. According to the latest FAO report, worldwide cereal production in 2024 will reach 2841 million tons, with maize accounting for 1271 million tons—approximately 45% of the total output [1]. In addition to its role as a staple food, maize is a crucial component in animal feed and industrial products such as biofuels and processed foods [2]. Consequently, the stable production of maize is vital to global food security and agricultural economic stability [3]. However, maize is vulnerable to multiple diseases during growth, with annual yield losses estimated to be between 15 and 30%. Effective and early disease identification is therefore essential to mitigate these losses.

Common maize diseases include leaf blight, grey leaf spot, small spot, and rust [4]. Rust lesions, in particular, are often small and scattered, making them easy to overlook. Early symptoms of leaf blight and grey leaf spot are highly similar, as are those of rust and small spot, increasing the risk of misdiagnosis—especially in coinfection scenarios. Traditional disease identification relies on expert visual inspection of the lesion size, growth pattern, and infection site. This approach is not only labour-intensive and time-consuming but also subjective and experience-dependent, resulting in significant limitations in accuracy [5]. DNA-based chemical identification methods are accurate, but they are expensive and require considerable time, making them unsuitable for addressing the demands of modern precision agriculture [6]. In particular, in large-scale cultivation, automated recognition technologies with practical deployment capability are urgently needed.

The growing trend of intelligent transformation in agricultural production has promoted the successful application of computer vision technology for the intelligent identification of crop diseases in agricultural environments. Object detection techniques have driven a paradigm shift in computer vision, creating new opportunities for the automated recognition of crop diseases [7]. By enabling pixel-level target localization, these methods overcome a key limitation of traditional deep learning classification models, which can predict categories but fail to provide spatial information. Dong et al. [8] proposed a YOLOv5s-C3CBAM model for recognizing complex maize diseases, reporting improvements in both the mean average precision and recall compared with the baseline model. Meng et al. [9] developed YOLO-MSM, an algorithm that incorporates multiscale deformable kernel convolutions for corn leaf disease recognition. The use of this Multiscale Kernel Convolution (MKConv) significantly enhanced the overall network performance, although the model exhibited a relatively high false detection rate under complex field conditions. To improve the detection of small lesions on apple leaves, Hou et al. [10] introduced an FPN-ISResNet-Faster R-CNN model that integrates a feature pyramid network (FPN) and an Inception-SE block (ISResNet) to provide better feature representations for small targets. Compared with those of the conventional Faster R-CNN, the AP50 and AP75 of this method were increased by 11.2% and 8.7%, respectively, mitigating its tendency to miss small lesions. However, the model incurred high computational costs and limited real-time performance, hindering its deployment on Unmanned Aerial Vehicles (UAVs) and mobile platforms. To address the challenge of dense maize disease recognition, Li et al. [11] proposed the GhostNet_Triplet_YOLOv8s algorithm, which balanced the recognition speed and accuracy within a compact 11.2 MB model. Nonetheless, constrained by the local receptive fields of convolutional architectures, the model demonstrated a limited capacity to capture long-range dependencies, resulting in reduced accuracy under occlusion or high-density distributions. In summary, recent works have established YOLO as a pivotal tool in object detection, with these models demonstrating strong performance in experimental settings. However, the main focus of the YOLO architecture is the inference speed, which often compromises accuracy, particularly for small objects and targets embedded in complex backgrounds. Furthermore, while Non-Maximum Suppression (NMS) helps reduce duplicate detections, it introduces additional inference latency and hyperparameters that can impair the detection stability, ultimately limiting its practical performance in real farmland environments.

The introduction of the DEtection TRansformer (DETR) in 2020 offered an end-to-end detection paradigm without hand-designed components such as NMS [12]. However, despite the improved inference speed, the DETR suffers from a high computational cost and slow convergence. Subsequent research produced efficient variants, including the Deformable-DETR [13], Conditional-DETR [14], DN-DETR [15], DAB-DETR [16], and RT-DETR [17]. The Real-Time DETR (RT-DETR) notably reduces the computational overhead while retaining high accuracy, making it suitable for real-time applications [18]. This model has been successfully applied in agricultural contexts [19–21], although its use in crop disease detection remains underexplored.

To address recognition challenges under complex field conditions such as fluctuating illumination, the deformable attention mechanism in the RT-DETR dynamically predicts the sampling point offsets, enabling the model to focus more effectively on the morphological features of lesions. This advantage enhances its ability to accurately localize diseases in demanding environments. To overcome the difficulties posed by irregular lesion distributions and high interclass similarity among maize diseases, the Attention-In-Feature-Interaction (AIFI) module within the hybrid encoder facilitates cross-scale feature interaction through token compression, and this approach more effectively captures lesions of varying sizes on maize plants, from small rust spots to large blight areas. Whereas conventional CNNs rely on local convolutional operations and are limited in global contextual modelling, the self-attention mechanism in the RT-DETR captures long-range dependencies, thereby improving the classification accuracy for visually similar disease categories through relational reasoning between lesions. Moreover, its interpretable attention mechanism can increase the fine-grained discrimination of similar lesions, supporting the development of a robust and lightweight in-field disease identification system. Due to its end-to-end architecture, multiscale attention mechanism, and real-time inference performance, the RT-DETR serves as a strong baseline that balances accuracy and efficiency. The main contributions of this paper are summarized as follows:

Construction of a maize disease dataset comprising 9327 images across eight categories—leaf blight, grey leaf spot, common rust, downy mildew, ear rot, small spot, brown spot, and healthy leaves—under real field conditions.
To effectively reduce the model complexity and suppress the noise interference experienced under realistic field conditions, SimAM-StarNet was adopted as the backbone feature extraction network in the SCFM-DETR. By adaptively emphasizing the salient features related to maize diseases, the number of parameters can be significantly decreased while enhancing multiscale feature fusion and improving the accuracy of feature extraction.
To increase the discernibility of disease targets against complex backgrounds and to improve the classification of morphologically similar maize diseases, we propose the CGLU-FasterBlock-MANet (CFM) module as an integral component of the SCFM-DETR encoder. The CFM module effectively enriches fine-grained feature representation without compromising computational efficiency, leading to a substantial improvement in overall recognition accuracy.
The results of extensive experiments with our maize disease dataset collected under complex farmland conditions demonstrate that the SCFM-DETR model achieves superior performance across key metrics, including the precision, recall, and GFLOPs, offering an optimal balance between accuracy and computational efficiency.

Materials and methods

Dataset source and category composition

The image dataset used in this study was compiled from two sources. The first subset was collected through manual acquisition in natural farmland environments at the National High-Tech Agricultural Park of Anhui Agricultural University, Hefei, China. To accurately represent the variability of real-world field conditions, images were captured in diverse scenarios, including low-light evening conditions, artificial lighting, and sunny and cloudy weather, as well as challenging situations such as leaf occlusion, small-scale lesions, and instances of multiple cooccurring diseases. This subset includes eight categories of maize conditions: leaf blight, grey leaf spot, common rust, downy mildew, ear rot, small spot, brown spot, and healthy leaves. The second subset was obtained from the publicly available IDADP maize disease dataset provided by the Science Data Bank [22]. After a rigorous curation process, 3109 images that met the quality standards were selected for use in this research. The distribution of the dataset is detailed in Table 1, and representative samples are visually presented in Fig. 1.

Table 1. Dataset category distributionClassesManual filmingIDADPAllBlight34750397Grey leaf spot4090409Common rust306100406Downy mildew3910391Ear rot disease267100367Small spot30850358Brown spot3810381Healthy35050400

Fig. 1. Maize disease image dataset. a Illustration of the dataset categories. b, c Examples from the dataset representing different scenarios

Dataset preprocessing

All of the images were resized to 640 × 640 pixels and annotated by using LabelImg in the TXT format. Data augmentation techniques—random rotation, cropping, scaling, brightness adjustment, blur, and noise addition—were applied to improve the model generalizability and robustness. The augmented dataset contains 9,327 images, and the images were split into training (7461), validation (933), and testing (933) sets at an 8:1:1 ratio. Unless otherwise stated, all experiments in this paper were conducted on this dataset.

Overall architecture of the SCFM-DETR model

The RT-DETR architecture, which was proposed by Baidu in 2023, consists of three main components: a backbone network, an encoder, and a decoder [23]. Compared with YOLO-based models, the RT-DETR leverages transformer-based attention to capture the global context, improving the detection accuracy under occlusion or dense distributions. Unlike two-stage detectors such as Faster R-CNN, the region proposal network and NMS are eliminated, streamlining the pipeline and increasing efficiency. However, its accuracy remains inadequate under complex field conditions due to background clutter, scattered lesions, and interclass similarity. Additionally, its computational demand is prohibitive for resource-limited devices.

To address the limitations of the RT-DETR for crop disease recognition in complex farmland environments, we propose the SCFM-DETR model, an improved architecture that is detailed in Fig. 2. The high computational demand of the RT-DETR’s original feature extraction backbone limits its applicability in resource-limited agricultural settings. Therefore, we substituted it with the more efficient StarNet backbone to achieve a better balance between accuracy and computational efficiency. Furthermore, to increase the robustness against complex maize field backgrounds, we integrated the SimAM attention module into StarNet, improving the multiscale feature fusion without increasing the parametric complexity. Furthermore, to improve the detection of fine-grained features, especially for small disease spots and subtle lesion edges, we developed a new CFM (comprising CGLU, FasterBlock, and MANet) module as a replacement for the original RepC3 block.

Fig. 2. Schematic diagram of the SCFM-DETR structure

Design of the SimAM-StarNet backbone

We integrate the SimAM attention mechanism into the star blocks of StarNet, calling the new blocks SSt blocks. Replacing the standard star blocks with SSt blocks results in a new backbone, SimAM-StarNet, as illustrated in Fig. 3.

Fig. 3. Schematic diagram of the SimAM-StarNet structure

SimAM attention mechanism

SimAM [24] employs the perspective of “energy functions” within neuroscience to measure the importance of each neuron within 3D feature maps, thereby generating attention weights. Lower energy values indicate greater distinctiveness between that neuron and its neighbours, signifying its greater importance. For an input feature map $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{X}\in {\mathrm{R}}^{\mathrm{C}\times \mathrm{H}\times \mathrm{W}}$$\end{document}$ , For a given channel $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{c}$$\end{document}$ , its feature values are: $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{X}}_{\mathrm{c}}=\left\{{\mathrm{x}}_{1},{\mathrm{x}}_{2},\cdots, {\mathrm{x}}_{\mathrm{H}\mathrm{W}}\right\}$$\end{document}$ , For a neuron $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{x}}_{\mathrm{I}}$$\end{document}$ in channel $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{c}$$\end{document}$ , the energy function is defined as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{E}}\left( {{{\mathrm{x}}_{\mathrm{i}}}} \right)={\left( {{{\mathrm{x}}_{\mathrm{i}}} - {\upmu _{\mathrm{c}}}} \right)^2}+\uplambda \cdot \upsigma _{{\mathrm{c}}}^{2}$$\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\upmu _{\mathrm{c}}}=\frac{1}{{{\mathrm{HW}}}}\mathop \sum \limits_{{{\mathrm{j}}=1}}^{{{\mathrm{HW}}}} {{\mathrm{x}}_{\mathrm{j}}}$$\end{document}$ is the channel-wise mean, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsigma _{{\mathrm{c}}}^{2}=\frac{1}{{{\mathrm{HW}}}}\mathop \sum \limits_{{{\mathrm{j}}=1}}^{{{\mathrm{HW}}}} {\left( {{{\mathrm{x}}_{\mathrm{j}}} - {\upmu _{\mathrm{c}}}} \right)^2}$$\end{document}$ is the channel-wise variance, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\uplambda =1{e^{ - 4}}$$\end{document}$ . Normalise the energy using the sigmoid function to obtain the attention weights:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\upomega _{\mathrm{i}}}=\upsigma \left( {\frac{1}{{{\mathrm{E}}\left( {{{\mathrm{x}}_{\mathrm{i}}}} \right)}}} \right)$$\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upsigma \left( {\mathrm{z}} \right)=\frac{1}{{1+{{\mathrm{e}}^{ - {\mathrm{z}}}}}}$$\end{document}$ . The output feature is obtained by element-wise multiplication:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{array}{*{20}{l}} {{{{\hat{\text {x}}}}_{\mathrm{i}}}={\upomega _{\mathrm{i}}} \cdot {{\mathrm{x}}_{\mathrm{i}}}} \end{array}$$\end{document}

As depicted in Fig. 4, while conventional channel-based (a) and spatial attention mechanisms (b) are constrained by their respective dimensionality, SimAM (c) generates a three-dimensional attention weighting matrix with clear physical interpretability.

Fig. 4. Comparison of the attention mechanisms: channel, spatial, and SimAM

StarNet

StarNet [25] is a lightweight CNN backbone network proposed by the Microsoft team at CVPR 2024. Its core innovation lies in achieving implicit high-dimensional feature mapping through the “Star Operation”, enabling exponential growth in nonlinear expressive power without widening the network. Drawing inspiration from polynomial kernel techniques, it accomplishes high-dimensional feature representation within a low-dimensional computational space, establishing a new paradigm for efficient network design. The specific operation is as follows.

In a single layer of a neural network, the star operation is expressed as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left({\mathrm{W}}_{1}^{\mathrm{T}}\mathrm{X}+{\mathrm{B}}_{1}\right)\mathrm{*}\left({\mathrm{W}}_{2}^{\mathrm{T}}\mathrm{X}+{\mathrm{B}}_{2}\right)$$\end{document}$ , this expression simplifies to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left({\mathrm{W}}_{1}^{\mathrm{T}}\mathrm{X}\right)\mathrm{*}\left({\mathrm{W}}_{2}^{\mathrm{T}}\mathrm{X}\right)$$\end{document}$ , where, the weight matrix and bias are combined into a single entity, denoted as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{W}=\left[\begin{array}{c}\mathrm{W}\\ \mathrm{B}\end{array}\right]$$\end{document}$ , and similarly, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{X}=\left[\begin{array}{c}\mathrm{X}\\ 1\end{array}\right]$$\end{document}$ . In scenarios involving single-output channel conversion and single-element input, we define $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\upomega _1},{\upomega _2},{\mathrm{x}} \in {{\mathrm{R}}^{\left( {{\mathrm{d}}+1} \right) \times 1}}$$\end{document}$ , where d denotes the number of input channel. It can be extended to multiple output channels and process multiple feature elements, in which case: $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{W}}_{1},{\mathrm{W}}_{2}\in {\mathrm{R}}^{\left(\mathrm{d}+1\right)\times \left({\mathrm{d}}^{{\prime}}+1\right)},\mathrm{X}\in {\mathrm{R}}^{\left(\mathrm{d}+1\right)\times \mathrm{n}}$$\end{document}$ . StarNet rewrites the star operation as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \upomega _{1}^{{\mathrm{T}}}{\mathrm{x}}*\upomega _{2}^{{\mathrm{T}}}{\mathrm{x}} & =\left( {\mathop \sum \limits_{{i=1}}^{{{\mathrm{d}}+1}} \upomega _{1}^{{\mathrm{i}}}{{\mathrm{x}}^{\mathrm{i}}}} \right)*\left( {\mathop \sum \limits_{{{\mathrm{j}}=1}}^{{{\mathrm{d}}+1}} {\upomega}_{2}^{{\mathrm{j}}}{{\mathrm{x}}^{\mathrm{j}}}} \right) \\ & =\mathop \sum \limits_{{{\mathrm{i}}=1}}^{{{\mathrm{d}}+1}} \mathop \sum \limits_{{{\mathrm{j}}=1}}^{{{\mathrm{d}}+1}} \upomega _{1}^{{\mathrm{i}}}\upomega _{2}^{{\mathrm{j}}}{{\mathrm{x}}^{\mathrm{i}}}{{\mathrm{x}}^{\mathrm{j}}} \\ & =\mathop \sum \limits_{{{\mathrm{i}}=1}}^{{{\mathrm{d}}+1}} \mathop \sum \limits_{{{\mathrm{j}}=1}}^{{{\mathrm{d}}+1}} {\upalpha _{\left( {{\mathrm{i}},{\mathrm{j}}} \right)}}{\upvarphi _{\left( {{\mathrm{i}},{\mathrm{j}}} \right)}}\left( {\mathrm{x}} \right) \\ \end{aligned} $$\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ $$$${\upalpha _{\left( {{\mathrm{i}},{\mathrm{j}}} \right)}}=\left\{ {\begin{array}{*{20}{l}} {\upomega _{1}^{{\mathrm{i}}}\upomega _{2}^{{\mathrm{j}}}}& {{\mathrm{if}}\;{\mathrm{i}}=={\mathrm{j}},} \\ {\upomega _{1}^{{\mathrm{i}}}\upomega _{2}^{{\mathrm{j}}}+\upomega _{1}^{{\mathrm{j}}}\upomega _{2}^{{\mathrm{i}}}}& {{\mathrm{if}}\;{\mathrm{i}}!={\mathrm{j}}.} \end{array}} \right.,$$$$ $$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ $$$${\upvarphi _{\left( {{\mathrm{i}},{\mathrm{j}}} \right)}}\left( {\mathrm{x}} \right)={{\mathrm{x}}^{\mathrm{i}}}{{\mathrm{x}}^{\mathrm{j}}},$$$$ $$\end{document}$ i and j serve as channel indices.

Single-layer star operations generate approximately $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{\left(\mathrm{d}+1\right)\left(\mathrm{d}+2\right)}{2}$$\end{document}$ linearly independent quadratic features, implicitly mapping the input to an $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{O}\left({\mathrm{d}}^{2}\right)$$\end{document}$ dimensional space. This achieves the same result as explicitly computing high-dimensional features, yet requires only 2d+2 parameters ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\upomega _1}\;{\mathrm{and}}\;{\upomega _2}$$\end{document}$ each of d+1 dimensions). When stacking k layers of star operations, the feature space dimension increases exponentially to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{O}\left({\mathrm{d}}^{{2}^{\mathrm{k}}}\right)$$\end{document}$ . This exponential dimensionality growth endows StarNet with exceptionally strong non-linear expressive capabilities, far surpassing traditional convolutional layers with equivalent parameter counts.

Design of the CGLU-FasterBlock-MANet (CFM) module

The RepC3 module serves as the core feature extraction component in the RT-DETR, and the quality of its output features plays a critical role in the detection performance of the subsequent prediction head. Although RepC3 employs a stacked 3 × 3 convolutional structure, its receptive field remains constrained due to a fixed dilation rate, thereby limiting its ability to detect small objects and distinguish targets in complex environments. As a result, RepC3 often struggles to capture fine-grained features with high accuracy. To address these issues, we introduce a replacement for RepC3—the CFM module, which integrates the Convolutional Gated Linear Unit (CGLU), FasterBlock, and Mixed Aggregation Network (MANet) components. To construct the CFM module, a novel Fas-CGLU Neck is first proposed. The CGLU is incorporated following the partial convolution within the FasterBlock to replace the original two 1 × 1 convolutions. After that, the original ConvNeck module in MANet is substituted with the Fas-CGLU Neck. The detailed architecture of the CFM module is illustrated in Fig. 5.

Fig. 5. Schematic diagram of the CFM structure

The core concept of MANet [26] is parallel processing combined with intelligent aggregation. It captures features at different levels and with varying receptive fields through multiple parallel feature extraction pathways [as indicated by $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{Y}}_{1},{\mathrm{Y}}_{2},\ldots, {\mathrm{Y}}_{\mathrm{n}+4}$$\end{document}$ in Eq. (5)], subsequently fusing these complementary features in an efficient manner, which can be mathematically expressed as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left\{ \begin{aligned} & {\mathrm{Y}}={\mathrm{Con}}{{\mathrm{v}}_{1 \times 1}}\left( {{{\mathrm{X}}_{{\mathrm{in}}}}} \right) \hfill \\ & {{\mathrm{Y}}_1}={\mathrm{Con}}{{\mathrm{v}}_{1 \times 1}}\left( {\mathrm{Y}} \right) \hfill \\ & {{\mathrm{Y}}_2}={\mathrm{DSConv}}\left( {{\mathrm{Con}}{{\mathrm{v}}_{1 \times 1}}\left( {\mathrm{Y}} \right)} \right) \hfill \\ & {{\mathrm{Y}}_3},{{\mathrm{Y}}_4}={\mathrm{Split}}\left( {\mathrm{Y}} \right) \hfill \\ & \left. {\begin{array}{*{20}{l}} {{{\mathrm{Y}}_5}={\mathrm{ConvNeck}}\left( {{{\mathrm{Y}}_4}} \right)+{{\mathrm{Y}}_4}} \\ {{{\mathrm{Y}}_6}={\mathrm{ConvNeck}}\left( {{{\mathrm{Y}}_5}} \right)+{{\mathrm{Y}}_5}} \\ \cdots \\ {{{\mathrm{Y}}_{{\mathrm{n}}+4}}={\mathrm{ConvNeck}}\left( {{{\mathrm{Y}}_{{\mathrm{n}}+3}}} \right)+{{\mathrm{Y}}_{{\mathrm{n}}+3}}} \end{array}} \right\}\quad {\mathrm{n}} \hfill \\ \end{aligned} \right.$$\end{document}

Here, the channel count of Y is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2 \mathrm{c}$$\end{document}$ . Whereas each of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{Y}}_{1},{\mathrm{Y}}_{2},\cdots, {\mathrm{Y}}_{\mathrm{n}+4}$$\end{document}$ has a channel count of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{c}$$\end{document}$ . Replacing the FasCGLU in formula (9) with the ConvNeck in formula (5) yields the output mathematical expression for the CFM module:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{CF}}{{\mathrm{M}}_{{\mathrm{out}}}}={\mathrm{Con}}{{\mathrm{v}}_{1 \times 1}}\left( {{\mathrm{Conca}}{{\mathrm{t}}_{\mathrm{C}}}\left( {{{\mathrm{Y}}_1},{{\mathrm{Y}}_2}, \ldots {{\mathrm{Y}}_{{\mathrm{n}}+4}}} \right)} \right)$$\end{document}

CGLU module

CGLU achieves adaptive feature fusion through its unique gating mechanism, which can be mathematically expressed as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{F}}\left( {\mathrm{X}} \right)={\mathrm{GELU}}\left( {{\mathrm{DWConv}}\left( {{{\mathrm{W}}_1}{\mathrm{X}}} \right)} \right),\quad {\mathrm{G}}\left( {\mathrm{X}} \right)={{\mathrm{W}}_2}{\mathrm{X}}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} {\mathrm{CGLU}}\left( {\mathrm{X}} \right) & ={\mathrm{X}}+{\mathrm{Dropout}}\left( {{\mathrm{F}}\left( {{{\mathrm{X}}_1}} \right) \odot {\mathrm{G}}\left( {{{\mathrm{X}}_2}} \right)} \right), \\ \left[ {{{\mathrm{X}}_1},{{\mathrm{X}}_2}} \right] & ={\mathrm{Spli}}{{\mathrm{t}}_{\mathrm{C}}}\left( {{\mathrm{Con}}{{\mathrm{v}}_{1 \times 1}}\left( {\mathrm{X}} \right)} \right) \end{aligned}$$\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{X}\in {\mathrm{R}}^{\mathrm{N}\times \mathrm{C}\times \mathrm{H}\times \mathrm{W}}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{W}}_{1}\in {\mathrm{R}}^{\left(2\mathrm{h}\right)\times \mathrm{C}\times 1\times 1}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{W}}_{2}\in {\mathrm{R}}^{\mathrm{C}\times \mathrm{h}\times 1 \times 1}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathrm{h}} = \left\lfloor {\frac{{2{\mathrm{C}}}}{3}} \right\rfloor $$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{F}\left(\mathrm{X}\right)$$\end{document}$ represents the feature transformation path, while denotes the gating weight matrix dynamically generated from the same input. The key innovation lies in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{G}\left(\mathrm{X}\right)$$\end{document}$ being a dynamically computed tensor for each input sample $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{X}$$\end{document}$ , allowing the network to autonomously determine feature processing strategies based on the current input content. This element-wise multiplication fusion operates simultaneously across both the channel and spatial dimensions, enabling the network to apply differentiated weighting across distinct feature channels and image regions. This enhances both the learning capacity and representational power of the network. The structure of the CGLU module is depicted in Fig. 6.

FasterBlock component

The FasterBlock component employs an adaptive fusion mechanism to combine the cross-layer features, while its integration of nonlinear activation functions and normalization operations enhances the model’s nonlinear representation capacity. Through residual connections, this module improves gradient flow and promotes feature reuse, simultaneously contributing to a reduction in the number of model parameters. The architecture of FasterBlock is presented in Fig. 6.

Design of the Fas-CGLU neck

By integrating the CGLU with FasterBlock, the resulting Fas-CGLU Neck substantially enhances the feature representation capability, significantly improving the model’s adaptability to complex agricultural scenes. This improvement is particularly evident in nonlinear information modelling, where the module can identify fine-grained object details and boundaries. These advancements lead to increased recognition accuracy and robustness to environmental variations. The structure of Fas-CGLU Neck is shown in Fig. 6, with the mathematical expression referenced as formula (9).

Fig. 6. Structural diagram of the Fas-CGLU Neck

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{FasCGLU}}\left( {\mathrm{X}} \right)={\mathrm{X}}+{\mathrm{DropPath}}\left( {{\mathrm{CGLU}}\left( {{\mathrm{PartialConv}}3\left( {\mathrm{X}} \right)} \right)} \right)$$\end{document}

Here, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{P}\mathrm{a}\mathrm{r}\mathrm{t}\mathrm{i}\mathrm{a}\mathrm{l}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\left(\mathrm{X}\right)=\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{c}\mathrm{a}\mathrm{t}\left({\mathrm{W}}_{\mathrm{p}}{\mathrm{X}}_{\mathrm{p}}, {\mathrm{X}}_{\mathrm{s}\mathrm{k}\mathrm{i}\mathrm{p}}\right)$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{X}}_{\mathrm{p}}$$\end{document}$ represents the first $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \left\lfloor {\frac{{{\mathrm{C}}_{{{\mathrm{dim}}}} }}{{{\mathrm{n}}_{{{\mathrm{div}}}} }}} \right\rfloor $$\end{document}$ channels, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{X}}_{\mathrm{s}\mathrm{k}\mathrm{i}\mathrm{p}}$$\end{document}$ denotes the remaining channel, which can be passed through directly, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathrm{W}}_{{\mathrm{p}}} \in {\mathrm{R}}^{{\left\lfloor {\frac{{{\mathrm{C}}_{{{\mathrm{dim}}}} }}{{{\mathrm{n}}_{{{\mathrm{div}}}} }}} \right\rfloor \times 1 \times 3 \times 3}} $$\end{document}$ .

Experiments and analysis

Experimental environment and hyperparameter settings

All of the experiments were conducted by using PyTorch on a server with an NVIDIA RTX 4080 SUPER GPU (16 GB of RAM), Windows 10, Python 3.13.2, PyTorch 2.6.0, and CUDA 12.4. The hyperparameters are listed in Table 2.

Table 2. Main hyperparameter settings for the experimental environmentHyperparameterValueTraining epochs300Resolution640 × 640Batch size4Weight decay0.0005Activation functionSiLUInitial learning rate0.0001OptimizerAdam

Definition of the evaluation metrics

We employed the number of parameters (Params), Giga Floating-point Operations Per Second (GFLOPs), and model size as metrics for evaluating model efficiency. These metrics reflect the model’s comprehensive efficiency characteristics across three core dimensions: storage overhead, computational complexity, and deployment costs. Additionally, we adopted precision (P), recall (R), and mean average precision (mAP) as metrics for assessing model accuracy, defined as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{P}}\left( {{\mathrm{Precision}}} \right)=\frac{{{\mathrm{TP}}}}{{{\mathrm{TP}}+{\mathrm{FP}}}}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{R}}\left( {{\mathrm{Recall}}} \right)=\frac{{{\mathrm{TP}}}}{{{\mathrm{TP}}+{\mathrm{FN}}}}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{A}}{{\mathrm{P}}_{\mathrm{i}}}=\mathop \smallint \limits_{0}^{1} {{\mathrm{P}}_{\mathrm{i}}}\left( {{{\mathrm{R}}_{\mathrm{i}}}} \right){\mathrm{d}}{{\mathrm{R}}_{\mathrm{i}}}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{mAP}}=\frac{1}{{\mathrm{C}}}\mathop \sum \limits_{{{\mathrm{i}}=1}}^{{\mathrm{C}}} {\mathrm{A}}{{\mathrm{P}}_{\mathrm{i}}}$$\end{document}

In the evaluation metrics, TP (true positive) refers to the number of correctly identified positive instances, while FN (false negative) represents positive instances that were misclassified. FP (false positive) indicates negative instances that were incorrectly identified as positive. The AP (average precision) quantifies the model’s recognition accuracy for a specific category, with the mAP calculated as the average of the AP values across all C-detected categories at the default confidence threshold of 0.5.

Comprehensive performance evaluation of the SCFM-DETR

In this study, a systematic performance evaluation of the SCFM-DETR and its baseline model, the RT-DETR, was conducted on the maize disease dataset. We compared the SCFM-DETR with multiple RT-DETR variants, RT-DETR-R18, RT-DETR-R34, RT-DETR-R50, and RT-DETR-R101, on the basis of the ResNet backbones. The results are shown in Table 3.

Table 3. Comprehensive performance comparison between the RT-DETR series and the SCFM-DETRModelP%R%mAP%Params (M)GFLOPsWeightsSize (MB)RT-DETR-R1892.989.893.619.957.0307.5RT-DETR-R3493.490.594.931.188.8479.2RT-DETR-R5093.991.395.341.9129.6654.6RT-DETR-R10194.193.997.274.7247.11169.9SCFM-DETR95.895.896.710.629.2163.7

In this study, RT-DETR-R18, which offers a balance between performance and efficiency, was selected as the baseline model, and the enhanced SCFM-DETR model was evaluated. Compared with the lightweight baseline RT-DETR-R18, the SCFM-DETR model demonstrates superior efficiency, with reductions of 47% in terms of the number of parameters, 49% in terms of the GFLOPs, and 47% in terms of the model size. Moreover, the SCFM-DETR model demonstrated improved detection performance, with increases of 2.9% in terms of the precision, 6.0% in terms of the recall, and 3.1% in terms of the mAP. These results clearly illustrate that the SCFM-DETR model achieves an optimal trade-off between detection performance and computational efficiency, making it highly suitable for deployment in resource-constrained scenarios and underscoring its effectiveness in lightweight model design.

The training processes of RT-DETR-R18 and the proposed SCFM-DETR model were next compared, as shown in Fig. 7. As shown in Fig. 7a, the SCFM-DETR model exhibits lower validation loss throughout training. Figure 7b demonstrates that the SCFM-DETR model achieves higher mAP values than RT-DETR-R18 with faster convergence and a more stable ascending trajectory, suggesting that the architectural improvements help mitigate overfitting and increase both the detection accuracy and training robustness. The consistent superiority of the SCFM-DETR in terms of both loss reduction and accuracy gain confirms the efficacy of the proposed modifications.

Fig. 7. Comparison of the validation loss (a) and mAP (b) curves between RT-DETR-R18 and the SCFM-DETR during training

Effectiveness verification of key modules in the SCFM-DETR

To enhance model performance, we introduced three key components: the StarNet backbone, the SimAM attention module, and the CFM feature fusion module. To this end, a set of ablation experiments was designed to dissect their individual and combined effects.

Effectiveness of the StarNet backbone

The four variants in the StarNet series (StarNet_s1 to StarNet_s4) are designed with progressively increasing complexity, following a “lightweight to high-accuracy” principle. Their varying depths, parameter counts, computational complexities, and feature extraction capabilities make them suitable for different computational constraints and task demands. In this study, the impact of each variant as the backbone network on the detection efficiency of the RT-DETR model was evaluated. The corresponding results are presented in Table 4.

Table 4. Information on different versions of StarNetVariantDepthP%R%mAP%Params (M)GFLOPsStarNet_s1[2,2,8,3]92.588.793.111.931.5StarNet_s2[1,2,6,2]94.091.294.612.031.8StarNet_s3[2,2,8,4]94.290.394.414.336.4StarNet_s4[3,3,12,5]94.792.094.816.041.2

StarNet_s2 provides the best balance between accuracy and efficiency. It outperforms StarNet_s1 in terms of feature representation but remains more efficient than the deeper StarNet_s3 and s4 variants. These characteristics are ideal for the resource-limited scenarios common in agricultural applications. Therefore, StarNet_s2 was chosen for the subsequent comparative experiments.

To develop a lightweight yet accurate model suitable for practical deployment, in this study, a range of backbone networks within the RT-DETR framework were evaluated. The proposed StarNet was compared against the original ResNet-18 as well as several recent architectures, including MobileNetV4, FasterNet, the Swin Transformer, RepViT, TransNeXt, and VanillaNet. A comprehensive analysis was performed to assess the influence of different backbones on the detection efficiency, and the results are detailed in Table 5.

Table 5. Comparison of backbone network substitution based on the RT-DETR modelBackbone (RT-DETR)P%R%mAP%Params (M)GFLOPsWeights size (MB)ResNet1892.989.893.719.957.0307.5MobileNet v493.290.193.711.339.5177.0FasterNet92.689.493.210.828.5169.1Swin Transformer95.191.695.036.397.0558.8RepVIT93.590.393.813.336.4208.6TransNeXt94.991.794.821.164.6537.7VanillaNet94.791.694.521.7110.2424.1StarNet94.091.294.612.031.8187.2

In terms of accuracy, architectures such as the Swin Transformer and TransNeXt achieve high values due to their sophisticated designs; however, they also have substantial resource demands. Although StarNet exhibits slightly lower precision (P), recall (R), and mAP values compared with the Swin Transformer, it requires only one-third of the parameters and model size. With regard to the computational efficiency, lightweight models such as FasterNet struggle to achieve high accuracy, demonstrating the limitations of traditional lightweight models in the joint optimization of accuracy and efficiency. StarNet and MobileNet v4 have similar computational scales, with the mAP of StarNet being only 0.9% greater. However, MobileNet v4’s use of an Universal Inverted Bottleneck (UIB) and Multi-query Attention (MQA) leads to limited interpretability, a large hyperparameter search space, and considerable black-box behaviour. In contrast, StarNet’s star operation is structurally simple and can be readily integrated into various CNN or transformer architectures, offering stronger generalization potential.

In summary, StarNet demonstrates competitive performance among contemporary backbones, significantly reducing the computational overhead while maintaining a favourable balance between accuracy and efficiency.

Effectiveness of the SimAM attention module

To enhance the representational ability of the StarNet backbone, we integrated the SimAM attention mechanism. To evaluate the effectiveness of SimAM, the effects of classic attention mechanisms [Squeeze-and-Excitation (SE), Coordinate Attention (CA), and the Convolutional Block Attention Module (CBAM)] on model performance were evaluated. The results of this comparison are presented in Table 6. SimAM, by dynamically adjusting the activation strength of neurons, increases the R value by 1.4% and the mAP value by 0.8% without adding to the model’s burden, thereby enhancing the representational capacity of the StarNet backbone in an efficient manner. The dual-channel attention of the CBAM achieves a performance similar to or even surpassing that of SimAM in terms of the recall rate, but the increase in parameters and computational load may limit its application in real-time tasks.

Table 6. Comparison of different attention mechanisms introduced into the StarNet backbone networkAttention mechanismP%R%mAPParams (M)GFLOPsWeights size (MB)–94.091.294.712.031.8187.2SE94.191.494.812.531.9194.5CA94.292.495.112.531.9195.4CBAM94.492.795.313.433.4205.6SimAM94.592.695.512.031.8187.2

Effectiveness of the CFM feature fusion module

To evaluate the effectiveness of the replacement of the RepC3 module with the CFM module, we conducted an ablation experiment comparing their performance, and the results are summarized in Table 7.

Table 7. Ablation experiment based on the CFMModuleP%R%mAP%Params (M)GFLOPsWeights size (MB)MANet+Fasterblock+CGLU(CFM)95.091.895.118.453.9282.2MANet+Fasterblock93.891.694.619.852.4303.8MANet93.491.294.122.059.3336.6RepC392.989.893.619.957.0307.5

The results in Table 7 verify that adding FasterBlock can reduce the number of parameters and computational complexity while preserving the feature extraction capability of the model. Although MANet can enhance the feature representations, FasterBlock is needed to suppress parameter expansion. By integrating the advantages of both components, the CGLU model establishes a closed loop of “feature selection–compression–fusion” through dynamic gating, ultimately yielding increases in the accuracy (mAP: +1.55%), efficiency (number of parameters: − 7.3%), and generalizability (R: +1.99%). These results validate the superiority of the CFM in terms of balancing the detection performance with the resource cost.

Ablation experiment with the SCFM-DETR model

The SCFM-DETR model incorporates three main improvements: the StarNet feature extraction backbone, the SimAM attention mechanism, and the CFM module. To systematically evaluate their individual and combined contributions, six ablation experiments were conducted by using RT-DETR-R18 as the baseline under consistent training settings. The overall results are summarized in Table 8, while the category-wise precision, recall, and mAP values for each experimental configuration are provided in Table 9 and visualized in Fig. 8.

Table 8SCFM-DETR ablation experiment based on RT-DETR-R18ModuleStarNetSimAMCFMP%R%mAP%Params (M)GFLOPsWeights size (MB)1×××92.989.893.719.957.0307.52××√95.091.895.218.453.9282.23√××94.091.294.712.031.8187.24√√×94.592.795.512.031.8187.25√×√95.193.196.010.629.2163.76√√√95.894.896.710.629.2163.7

Table 9. Comparison of the mAP across categories for the ablation experiments in each groupModule01234567All194.590.184.194.996.095.694.899.6 93.7 296.595.590.396.392.598.195.996.5 95.2 392.392.093.894.695.797.596.595.2 94.7 494.795.295.095.495.795.396.496.3 95.5 595.295.795.595.996.295.896.996.8 96.0 696.095.594.997.399.095.896.698.5 96.7 The average values are shown in bold

In Table 9 and 12, the numbers 0–7 represent the eight categories in the maize disease dataset, corresponding sequentially to leaf blight, grey leaf spot, rust, downy mildew, ear rot, small spot, brown spot, and healthy leaves.

Fig. 8. Comparison of the precision (a), recall (b), and mAP (c) across categories for the ablation experiments in each group

As indicated by Groups 1 and 2 in Table 8, replacing the RepC3 module with the proposed CFM module improved all of the performance metrics. These results suggest that the CFM module reduces parameter redundancy by optimizing the feature fusion pathway and reuse mechanism, thereby mitigating missed detections of maize diseases against complex backgrounds.

Given the relatively large scale of the original RT-DETR model, we substituted its backbone with the lightweight StarNet architecture. This replacement resulted in the number of parameters decreasing by 39.7% and the GFLOPs decreasing by 44.2%, while the detection accuracy increased slightly. Furthermore, integrating SimAM into the StarNet backbone—without increasing the number of parameters or computational cost—further increased the mAP and R by 0.8% and 1.5%, respectively, thereby enhancing the generalizability of the model.

The models from the aforementioned ablation studies were evaluated on the maize disease dataset, with the detection results visualized in Fig. 9 and the corresponding quantitative statistics presented in Table 10. In Fig. 9, subfigures (1) to (6) represent the results from the 1st to 6th ablation experimental groups, respectively. The red boxes indicate missed detections, the green bounding boxes denote false detections, and the remaining boxes indicate correct detections. The detection images were obtained from an independently collected validation dataset. Image 1 presents a single-leaf disease detection scenario under normal conditions, Image 2 depicts small-target disease detection under complex lighting conditions, Image 3 shows mixed disease detection involving similar categories, Image 4 demonstrates disease detection under leaf occlusion, and Image 5 depicts multi-leaf disease detection in complex backgrounds. These experiments further verify the model’s detection performance in challenging scenarios that are prone to false positives and missed detections.

Fig. 9SCFM-DETR ablation experiment detection results

Table 10. Statistics of the SCFM-DETR ablation experiment detection resultsModuleCorrect detectionsCorrect detection rate (%)Missed detectionsMissed detection rate (%)False detectionsFalse detection rate1555.6444.4222.2%2444.4444.4222.2%3555.6333.3222.2%4777.8111.1111.1%5777.8111.1222.2%6888.9111.100

According to the results presented in Fig. 9; Table 10, the adoption of StarNet considerably reduced the number of parameters while preserving the detection performance. With the subsequent integration of SimAM and the CFM, the detection accuracy increased from 55.6% to 77.8%, and the false detection rate decreased from 44.4% to 11.1%. Through the synergistic effect of these modules, the proposed SCFM-DETR model achieved a correct detection rate of 88.9%, which is an improvement of 33.3% points over the baseline (55.6%), along with a notable reduction in missed detections and the essentially elimination of false positives.

Notably, none of the six experimental models successfully identified the blurred disease targets in Image 4, where lesions were partially occluded by leaves. This result is likely due to the complex background in which the lesions were located, making them difficult to distinguish. Overall, the SCFM-DETR model achieved the best comprehensive detection performance, confirming the efficacy of the introduced modules in a variety of challenging field scenarios.

Comparison experiments between the SCFM-DETR model and mainstream models

To objectively evaluate the performance of the SCFM-DETR model, comparative experiments were conducted against several mainstream crop disease detection algorithms. As summarized in Table 11, the SCFM-DETR achieved precision and recall values of 95.8%, reflecting its strong ability to minimize both missed and false detections. While YOLO11n showed advantages in terms of the parameter count and computational cost over the SCFM-DETR model, its recall (R) and mean average precision (mAP) were the lowest among all of the models, likely due to the trade-off in accuracy caused by the lightweight design.

Table 11. Comparison experiment between the SCFM-DETR and mainstream modelsModuleP%R%mAP50%Params (M)GFLOPsWeightsSize (MB)YOLOv5m90.687.891.822.152.542.6YOLOv8s90.186.289.09.924.620.7YOLO11n86.674.182.32.66.35.3FasterR-CNN89.788.189.580.3196.0210.2Deformable DETR90.387.990.239.841.086.0SCFM-DETR95.895.896.710.629.2163.7

As detailed in Table 12; Fig. 10, the SCFM-DETR model achieved an mAP of 96.7%, exceeding that of the second-best model, YOLOv5m, by 4.9%. These results demonstrate that compared with existing approaches, the SCFM-DETR not only achieves superior detection accuracy but also maintains a reasonable computational cost.

Table 12. Comparison of the mAP for various categories between the SCFM-DETR and mainstream modelsModel/class01234567AllYOLOv5m90.888.189.293.394.291.688.798.3 91.8 YOLOv8s89.584.985.487.591.984.290.698.1 89.0 YOLOv11n82.878.579.883.093.270.875.994.4 82.3 Faster R-CNN87.585.788.191.293.689.188.892.0 89.5 Deformable DETR81.296.185.495.093.887.286.696.3 90.2 SCFM-DETR96.095.594.997.399.095.896.698.5 96.7 The average values are shown in bold Fig. 10. Comparison of the SCFM-DETR with mainstream models in terms of the precision (a), recall (b), and mAP (c)

Figure 11 presents a visual comparison of the SCFM-DETR and other models in terms of corn disease detection, providing a more intuitive demonstration of the performance differences and enabling a more comprehensive and direct evaluation of the SCFM-DETR model. The statistics of the detection outcomes are summarized in Table 13.

Fig. 11. Detection results of the SCFM-DETR and other models

Table 13. Statistical results of the SCFM-DETR and other modelsModuleCorrect detectionsCorrect detection rate (%)Missed detectionsMissed detection rate (%)False detectionsFalse detection rateYOLOv5m555.6444.400YOLOv8s555.6444.4111.1%YOLOv11n333.3555.6111.1%Faster R-CNN555.6444.400Deformable-DETR666.7222.2111.1%SCFM-DETR888.9111.100

The results reveal a noticeable decline in the detection performance of the YOLO series models from v5m to v11n, with the correct detection rate decreasing from 55.6 to 33.3%. Compared with YOLOv8s, which has a similar parameter scale to SCFM-DETR model, and the higher-performing YOLOv5m, the proposed SCFM-DETR model achieves the lowest missed detection rate while maintaining a higher correct detection rate than the other models.

Grad-CAM interpretability analysis

While the SCFM-DETR model demonstrated promising performance during training, a deeper understanding of its internal decision-making process is necessary. To enhance interpretability, we applied the Gradient-weighted Class Activation Mapping (Grad-CAM) algorithm [27], which produces heatmaps that visualize the contribution of different image regions to the model’s predictions. Areas with higher activation intensities indicate the model’s primary Regions of Interest (ROIs) [28]. The heatmaps generated during maize disease identification using the RT-DETR, SCFM-DETR, and YOLOv5m model, which performed relatively well in the comparative experiments, are shown in Fig. 12. These visualizations facilitate the comparison of ROIs across the disease images, thereby providing an intuitive interpretation that better illustrates the internal workings of the models.

Fig. 12. Recognition results of the SCFM-DETR and other models

Compared with the RT-DETR-R18 model, the SCFM-DETR model detected more comprehensive disease regions and provided more precise coverage for different types of maize diseases. In contrast to YOLOv5m, the SCFM-DETR model is less affected by irrelevant backgrounds during detection and demonstrates a stronger global discriminative ability in terms of distinguishing disease targets from non-disease background regions.

SCFM-DETR generalization experiments

To validate the generalizability and robustness of the model, we annotated an additional public maize disease dataset sourced from Kaggle (https://www.kaggle.com/datasets/ndisan/corn-leaf-disease). The dataset comprises four categories—leaf blight, common rust, leaf spot, and healthy leaves—with 1,000 images per category, resulting in a total of 4,000 images. Following the same protocol, the dataset was partitioned into training (3,200 images), validation (400 images), and test (400 images) sets at an 8:1:1 ratio.

As shown in Tables 14 and 15, the SCFM-DETR model achieved an overall mAP of 95.3%, surpassing that of the baseline model by 3.1%. In the case of leaf blight—the category with the most complex background conditions in the dataset—the SCFM-DETR model achieved an mAP of 90.1%, outperforming both the baseline RT-DETR-R18 (82.9%) and YOLOv5m (83.0%). These results underscore the superior generalization capacity and robustness of the proposed model.

Table 14. Detection metrics of the generalization experimentsModuleP%R%mAP%RTDETR-R1891.588.692.2YOLOv5m90.491.092.3SCFM-DETR94.892.795.3

Table 15. Class-wise detection metrics in the generalization experimentModuleBlightLeaf RustLeaf SpotHealthyAllRTDETR-R1882.992.593.899.592.2YOLOv5m83.092.794.099.592.3SCFM-DETR90.195.496.599.595.3

Discussion

The advantages of the SCFM-DETR model over comparable networks can be summarized as follows. As indicated in Tables 5 and 6, the SimAM-StarNet backbone contributes significantly to model lightweighting while still improving the accuracy. This achievement can be attributed to StarNet’s unique “star operation” (elementwise multiplication), which enables efficient feature representation. This operation maps input features into a high-dimensional nonlinear space with a compact network structure and low computational cost, thus yielding more expressive and informative representations without increasing the computational complexity. In addition, the integration of SimAM reduces interference from cluttered backgrounds via pixelwise dynamic feature recalibration, which helps lower the false detection rate. The data in Table 7; Fig. 9 demonstrate that the proposed CFM module, when replacing the original RepC3 module, successfully addresses the difficulties experienced by RepC3 in terms of capturing fine-grained features, thereby mitigating the problem of missed detections. Moreover, the heatmap results in Fig. 12 show that compared with both the baseline model and the relatively well-performing YOLOv5m, the SCFM-DETR is the least affected by complex backgrounds. It exhibits superior performance in natural, cluttered farmland environments, further validating its robustness and real-world applicability.

As summarized in Tables 11, 12 and 13, compared with existing mainstream models, the SCFM-DETR model achieves a superior overall detection performance on the maize disease dataset, with higher precision (P), recall (R), and mean average precision (mAP) across most disease categories. Nevertheless, the mAP values for grey leaf spot, rust, and small leaf spot remain relatively low, which can be attributed to the high visual similarity among these diseases, leading to increased interclass confusion. Specifically, when rust symptoms are present in a scattered distribution, their grey–brown spot-like morphology closely resembles that of small leaf spot, complicating accurate differentiation. Despite these challenges, the SCFM-DETR model achieved mAP scores of 94.9% and 95.8% across the aforementioned categories. The confusion matrices comparing the proposed SCFM-DETR approach with the baseline model are provided in the Appendix, further demonstrating the effectiveness of SCFM-DETR in mitigating confusion among visually similar disease categories. Moreover, in the generalization experiments conducted with an independent public maize disease dataset (Tables 14 and 15), the SCFM-DETR model maintained robust performance, underscoring its strong generalizability.

Nevertheless, the proposed SCFM-DETR model still presents certain limitations:

Although the dataset covers a relatively wide range of maize disease categories, some classes occur less frequently, leading to a certain degree of class imbalance. This imbalance may limit the model’s generalization capability when applied to more diverse or unseen disease distributions.
In complex field environments, the detection performance of SCFM-DETR decreases for low-resolution or severely blurred disease targets, this indicates that the model’s robustness under challenging real-world conditions still requires further improvement.
While SCFM-DETR is designed with lightweight considerations, its current architecture may still impose computational and memory overheads that hinder deployment on extremely resource-constrained edge devices.

In future work, we plan to construct a more comprehensive and balanced maize disease dataset with broader category coverage and higher image quality to further enhance recognition performance and robustness. In addition, we will continue to investigate advances in lightweight detection architectures and model compression techniques, aiming to develop more efficient and accurate solutions tailored for real-world agricultural disease identification scenarios.

Conclusion

In this study, we present the SCFM-DETR model, a detection model designed for accurate and efficient recognition of visually similar maize diseases under resource-limited farmland conditions. A maize disease dataset containing 9,327 images across eight categories, covering the most common disease types, was constructed to support the experiments. Through systematic module-wise comparisons and ablation experiments, the individual and combined contributions of the StarNet backbone, SimAM attention mechanism, and CFM feature fusion module were rigorously validated. These components collectively enhance model performance, with increases of 2.9% in precision, 6.0% in recall, and 3.1% in mAP over the baseline RT-DETR-R18 model while reducing the number of parameters, computational cost (GFLOPs), and model size by 47%, 49%, and 47%, respectively. Furthermore, compared with mainstream crop disease detection models, the SCFM-DETR model achieves an optimal balance between efficiency and accuracy, with a category-averaged mAP of 96.7% with only 10.6 M parameters. These outcomes confirm the efficacy of the proposed architectural improvements for maize disease recognition.

In summary, the proposed SCFM-DETR model enables accurate and efficient maize disease detection, achieving an optimal balance between computational efficiency and recognition performance. It effectively addresses key challenges in farmland deployment, such as interference from complex backgrounds and the constraints of edge-device computing resources, thereby offering both theoretical and practical advances in crop disease monitoring. In future work, we plan to further optimize the model and extend its application to a broader range of crop disease monitoring tasks, contributing to the advancement of object detection technologies in the agricultural domain.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1.

Bibliography4

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable detr: deformable Transformers for end-to-end object detection; 2020. ar Xiv preprint ar Xiv:2010.04159. 2010.04159
2Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L. Dn-detr: accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 13619–27.10.1109/TPAMI.2023.333541038019624 · doi ↗ · pubmed ↗
3Liu S, Li F, Zhang H, Yang X, Qi X, Su H et al. Dab-detr: Dynamic anchor boxes are better queries for detr. ar Xiv 2022. ar Xiv preprint ar Xiv:2201.12329. 2201.12329
4Feng Y, Huang J, Du S, Ying S, Yong JH, Li Y et al. Hyper-yolo: when visual object detection meets hypergraph computation. IEEE transactions on pattern analysis and machine intelligence. 2025; 47(4), 2388–2401.10.1109/TPAMI.2024.352437740030788 · doi ↗ · pubmed ↗