PDGV-DETR: Object Detection for Secure On-Site Weapon and Personnel Location Based on Dynamic Convolution and Cross-Scale Semantic Fusion

Nianfeng Li; Peizeng Xin; Jia Tian; Xinlu Bai; Hongjie Ding; Zhiguo Xiao; Qian Liu

PMC · DOI:10.3390/s26051542·February 28, 2026

PDGV-DETR: Object Detection for Secure On-Site Weapon and Personnel Location Based on Dynamic Convolution and Cross-Scale Semantic Fusion

Nianfeng Li, Peizeng Xin, Jia Tian, Xinlu Bai, Hongjie Ding, Zhiguo Xiao, Qian Liu

PDF

Open Access

TL;DR

This paper introduces PDGV-DETR, a new object detection framework optimized for detecting weapons and personnel in security surveillance images with high accuracy and robustness.

Contribution

The novel contribution is PDGV-DETR, which uses dynamic convolution and cross-scale fusion to improve detection accuracy and robustness in complex security scenarios.

Findings

01

PDGV-DETR achieved an mAP50 of 85.9% on a conflict scene dataset, outperforming RT-DETR with a p-value less than 0.01.

02

On the OD-WeaponDetection dataset, PDGV-DETR reached 93.0% mAP for gun and knife detection, a 2.2% improvement over RT-DETR.

03

The model improved detection accuracy by 15.1% compared to deformable DETR for personnel object localization.

Abstract

In public safety scenarios, the precise detection and positioning of prohibited weapons such as firearms and knives along with the involved personnel are the core pre-requisite technologies for violent risk warning and emergency response. However, in security surveillance scenarios, there are common problems such as object occlusion, difficulty in capturing small-sized weapons, and complex background interference, which lead to the shortcomings of existing general object detection models in the tasks of detecting and locating security-related objects, including poor adaptability, low detection accuracy, and insufficient robustness in complex scenarios. Therefore, this paper proposes a threat object detection framework for security scenarios (PDGV-DETR) based on adaptive dynamic convolution and cross-scale semantic fusion, specifically optimized for the detection and positioning tasks of…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

violent

Figures12

Click any figure to enlarge with its caption.

Funding1

—Project of Jilin Provincial Scientific and Technological Department

Keywords

violence detectioninteractive convolutionsemantic weaving networkbidirectional mixed featuresecurity scene

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Human Pose and Action Recognition

Full text

1. Introduction

The rapid development of smart cities has led to the transformation of public security governance towards more refined, intelligent, and real-time approaches. However, incidents such as school violence [1], physical conflicts, and illegal carrying of prohibited weapons still occur frequently in various scenarios. Identifying risks and potential hazards in public places like subways, back alleys, and schools precisely has become a core necessity for the construction of intelligent security systems.

The current research on security risk identification technologies can be mainly divided into two major technical directions. There are clear technical boundaries and capability differences between the two, and there is no horizontal performance comparability. One is violent behavior recognition (i.e., narrow-sense violent detection), which takes video sequences as the core input. Representative models include I3D, SlowFast, VideoMAE, etc. It completes the “violent/non-violent” behavior-level classification through modeling of temporal action features. The core is to distinguish aggressive behaviors from normal activities based on time-dynamic information. Its core task is behavior classification, and it generally lacks pixel-level precise detection capability for risk objects. The other is security-related object detection in security scenarios. This direction takes single-frame static images as the input and uses object detection technology to complete the detection of prohibited weapons such as guns and knives and risky individuals. It is the core foundation for achieving precise risk early warning and is also the core research scope of this paper.

Currently, violence detection can only determine whether images or videos contain violent content; it lacks the ability to locate objects, and is unable to meet practical needs such as early warning of terrorist attacks [2], protection of public space safety, and object-level handling of online violent content. For example, in the prevention of school violence [3], it is necessary to precisely locate the conflict area to provide intervention basis. In online media, violent content needs object-level processing rather than being deleted altogether. These scenarios urgently require integrated “detection + positioning” technology.

Traditional security systems rely on manual monitoring to identify risks, which is not only inefficient and inaccurate [4], but also prone to missed detections due to personnel fatigue. With the widespread adoption of intelligent video surveillance [5,6] and the exponential growth of video data, the integration of computer vision [7], deep learning [8], and behavior recognition [9] in object detection technologies has become a core research hotspot in the field of intelligent risk identification in security. Deep learning [10] methods represented by convolutional neural networks (CNN) [11,12] demonstrate outstanding pattern recognition capabilities. A review of violence detection [13] indicates that such technologies can effectively extract spatiotemporal features from massive data and accurately distinguish normal from abnormal behaviors. Moreover, the combination of CNN and long short-term memory (LSTM) networks [14] further enhances the modeling ability of temporal features, providing new technical ideas for security scenarios. However, most of these methods focus on behavior-level classification tasks and have not yet solved the problem of precise positioning of risk objects.

Despite many breakthroughs in object detection technology, there are still many core challenges in the practical application of weapon and person object detection in security scenarios. In a security scene, the object types and scales vary greatly. From small weapons such as pistols and knives to the synchronous detection of large objects, high requirements are put forward for the multi-scale feature processing ability of the model, which is prone to missing detection of small objects and positioning deviation of large objects. There are significant differences in illumination, viewing angle, and image blur in different surveillance scenes, which leads to the lack of cross-scene generalization ability of the model. For occluded and incomplete weapons and personnel objects, the feature extraction ability of existing models is weak, and the detection accuracy fluctuates greatly, which makes it difficult to meet the high reliability requirements of security scenes.

In the detection and location tasks of safety-related objects in security scenes, various mainstream general object detection algorithms show different adaptability. YOLO series algorithms have been widely used in many industrial detection fields due to their efficiency of single-stage regression architecture [15]. However, there are obvious shortcomings in the series of algorithms when dealing with multi-scale object features in security scenes, and the robustness of detection for occluded and small-sized weapons does not meet the deployment requirements of actual security scenes. In contrast, the Transformer-based RT-DETR model [16] shows significant advantages in capturing the details of complex objects and contextual semantic correlation, and has stronger feature extraction and cross-layer modeling capabilities. Its end-to-end real-time detection architecture also meets the core requirements of security systems for low latency. However, it still has inherent limitations in multi-scale feature fusion, feature calibration, and fine feature extraction. When it is directly transferred to safety-related object detection tasks in security scenes, there is still a large room for improvement in detection accuracy and stability.

In view of the four core challenges faced by security-related object detection in security scenes, which are “difficult to capture multi-scale objects, easy to miss incomplete objects, difficult to distinguish object from background, and poor cross-scene generalization ability”, this paper proposes an object detection model PDGV-DETR for weapon and personnel location in security scenes. The core contributions are as follows:

(1)A bidirectional hybrid feature Pyramid network with channel attention (DWH-FPN) is reconstructed, which realizes the bidirectional interaction between high-level semantic features and low-level detail features through transposed convolution, and combines channel attention to generate dynamic weights. It reduces redundant calculations and strengthens the binding of details and semantics in prohibited weapon detection, the ability to separate foreground and background in personnel object localization, and the ability to integrate global and local in scene understanding, which lays a foundation for multi-scale security-related object detection in security scenes.
(2)Adapt the Dynamic Hierarchical Channel Interaction Convolution Module (BasicPCock) to replace the traditional modules at the end of the backbone network. Through the dual-dimensional collaboration of fine channel convolution in the main path and adaptive channel mapping in the shortcut path, it reduces the computational load while enhancing the robustness of incomplete objects, avoiding missed detections of occluded objects, partial limbs, etc., and improving the integrity of tool and behavior recognition.
(3)The Global Semantic Weaving and Elastic Feature Alignment Network (GSWFN) is introduced and repurposed. Through the cross-scale semantic correlation mechanism, it significantly alleviates the problem of blurred object and background features, enhances the model’s scene understanding ability, effectively reduces object misjudgment and false positives in complex scenarios, and is suitable for various types of security monitoring scenarios.
(4)The systematic verification of multiple datasets and multiple models is carried out. Based on four security object detection datasets with multi-scene personnel and prohibited weapons, under the same experimental configuration, the performance of PDGV-DETR and 13 mainstream general object detection models in security-related object detection and localization tasks is tested, and the defects of the existing models with large accuracy fluctuations across data sets are revealed. The advantages of PDGV-DETR in positioning accuracy, missed detection rate and generalization are verified to support the implementation of the project.

2. Related Work

2.1. Research Status of Object Detection Technology in Security Scenarios

The research on risk recognition technology of public safety scenarios is mainly divided into two technical branches: violence behavior recognition (violence detection) and security-related object detection in security scenes. There are clear boundaries between the two in technical goals, input forms and output capabilities, and there is no horizontal performance comparability, which together constitute the risk early warning technology system of intelligent security systems.

2.1.1. Evolution of Violence Detection Technology

The development of violence detection technology has gone through two stages: “traditional machine learning” and “deep learning”. In the early stage, it relied on manual feature extraction and adaptive classifiers. After the rise of deep learning, methods based on convolutional neural networks (CNN), long short-term memory networks (LSTM), transformers, and their hybrid architectures (such as CNN + LSTM) became mainstream. Among them, CNN and transformer models were the most widely used, and the research could be classified into three directions:

Spatio-temporal convolution-based models: Santos et al. [17] proposed the 3D-CNN architecture, which directly modeled the spatio-temporal information of the video by stacking spatio-temporal convolutional layers, and achieved 81.75% (XS version) and 84.75% (S version) violence detection accuracy on the RWF-2000 dataset, but the high computational complexity limits the practical application. Magdy et al. [18] compared the 3D-CNN and 4D-CNN architectures and found that although the latter could capture long-term temporal dependencies, its accuracy on the RWF-2000 dataset was 94.67%, and the optimization difficulty significantly increased.

Recurrent network models: Liang et al. [19] combined YOLOv3, DeepSort, and SlowFast networks to achieve joint modeling of object tracking and temporal features, but the error accumulation caused by the recurrent structure led to a decrease in long-term tracking accuracy; Qu et al. [20] used a traditional CNN for violence classification, performing well in static images, but lacking object localization capabilities, leaving it unable to meet the location information requirements of security scenarios.

Transformer models: Akti et al. [21] used a vision transformer (ViT) to implement a violence behavior classification, capturing global features through self-attention mechanisms. The accuracy on the self-built SMFI dataset reached 95.5%, but the quadratic complexity made the inference speed relatively slow; Ehsan et al. [22] used generative adversarial networks (GAN) to reconstruct normal movement patterns, achieving an accuracy of 96% on the Hockey dataset, but the time consumption of the optical flow extraction stage was significant; Kumar et al. [23] designed a lightweight transformer (Tubelet embedding + ViViT architecture), achieving an accuracy of 98% in occlusion scenarios, but the model input required 30 frames, and it was still insufficient in handling multi-scale objects and interference from different scenarios.

2.1.2. The Core Requirements and Technical Boundaries of Object Detection in Security Scenarios

The core function of existing violence detection methods is behavior-level classification, which can only determine whether the input content contains violence-related information, generally lacks the ability to accurately locate the risk object at the pixel level, and cannot provide core decision data such as object position and category for the security system. If the object location is completed manually after classification, the labor cost will be high, and it is urgent to introduce object detection technology to break through this bottleneck.

The core goal of security-related object detection technology in security scenes is to complete category recognition and pixel-level boundary box localization for prohibited weapons and personnel objects such as guns and knives in a single static image of security monitoring. This is the core pre-technology for intelligent security systems to achieve accurate risk early warning and rapid emergency response, and is also the core evaluation dimension of all comparative experiments in this paper.

The technology is based on the general object detection technology, and is specially optimized for the object characteristics and environmental interference of the security scene. The current mainstream technology path is divided into three categories: two-stage detection, single-stage detection, and transformer end-to-end detection. Object-detection-related technologies have achieved large-scale application in industrial quality inspection, remote sensing images, medical images and other fields, but there are still obvious shortcomings in the specific adaptation of security scenes, which is also the core research direction of this paper.

It should be clear that a single static image can only capture the instantaneous visual characteristics of the action, and cannot restore the complete context of the behavior based on temporal dynamic information. This is the inherent boundary and core limitation of the attribute classification technology of violence behavior in still images. Therefore, the core category of this study is strictly limited to the detection and localization of the object level, rather than the qualitative classification of the behavior level.

For scenes with highly similar visual features, such as dancing and fighting, high-fiving and slapping, competitive sports and physical conflict, and friendly hugs and physical confrontation, it is impossible to achieve 100% accuracy classification only relying on a single frame of still images, and there is an inevitable risk of false positives. Based on the three visual features of spatial location relationship, body posture and weapon pointing observable in a single frame, the model in this paper can realize the preliminary classification and object localization of violent/non-violent behavior attributes, and provide a pre-reference for violence risk early warning. The final behavior properties still need to be verified by combining the time series information of continuous videos, which is the core reason why this study aims to expand time series modeling ability in the future.

2.2. Universal Object Detection Technology Has Insufficient Adaptability for Security-Related Scenarios Involving Weapons and Personnel Detection

At present, the general object detection technology can be divided into the classical two-stage object detection model in the early stage and the YOLO series of single-stage models popular in recent years. However, there are significant defects when they are directly transferred to the object detection task of prohibited weapons and personnel in a security scene.

In the early stage, multiple object detection models were developing simultaneously. In 2015, Shaoqing Ren et al. introduced the Region Proposal Network (RPN) based on R-CNN and Fast R-CNN to achieve efficient generation of candidate regions, and proposed the Faster R-CNN model [24], which is the classic two-stage object detection algorithm. However, its two-stage detection structure also inherently limits it in terms of real-time performance and cross-scale feature fusion. Mask R-CNN [25] is a model proposed by Kaiming He et al. in 2017 that combines object detection and instance segmentation functions. Compared with other models, its core design focuses on the instance segmentation task, and its performance in object detection is mediocre. Tian Zhi et al. proposed the FCOS model [26] in 2019, which is a fully convolutional single-stage object detector without anchor boxes or candidate region proposals. By eliminating the predefined anchor box set, FCOS completely avoids the complex calculations related to anchor boxes. Although lightweight, it is difficult to adapt to multi-class objects and complex security scenarios involving weapon and personnel detection tasks, resulting in poor detection performance. EfficientDet [27] is a new generation object detection model launched by Tan Mingxing et al. based on the EfficientNet model and the Bidirectional Feature Weighted Pyramid Network (BiFPN) in 2020. However, its design relies on multi-scaled feature fusion and lacks optimization for semantic and detail binding and cross-layer efficient fusion, resulting in serious deficiencies in feature extraction, semantic matching, and scene understanding capabilities.

In addition to the aforementioned object detection methods, currently, most researchers adopt the YOLO series models for detection. Among them, YOLOv5 [28] has become the preferred solution for industrial-level testing tasks due to its lightweight design and cost-effectiveness. However, excessive lightweighting can reduce detection accuracy. The local convolutional characteristics of the convolutional neural network (CNN) backbone network struggle to simulate long-distance dependency relationship between objects and scene context in security scenarios, and high-threshold non-maximum suppression (NMS) often mistakenly eliminates the correct object bounding boxes in dense crowds. YOLOv6 [29] effectively improves accuracy by optimizing the backbone network and improving the loss function, but it has residual anchor box problems and redundant feature fusion problems. YOLOv8 [30] establishes a benchmark position in the object detection task through cross-scale feature fusion capabilities and enhanced semantic extraction mechanisms, but it is sensitive to regression errors of limb overlap and posture deformation of personnel objects, making it difficult to adapt to the detection requirements of personnel-intensive and limb-occluded scenarios in security scenes. Subsequent iterative versions, including YOLOv10 [31], YOLOv11 [32], and YOLOv12 [33], introduce innovations in architecture and training strategies, achieving performance improvements in single-class object detection and inference efficiency. However, in the multi-scale object synchronous detection task of the security scene focused in this study, YOLOv10 has insufficient feature capture ability for occluded and small-sized weapon objects, and is prone to missing small objects. YOLOv11 and YOLOv12, although showing strong competitiveness in single-class weapon detection, have limitations in global context modeling, resulting in larger fluctuation in accuracy than RT-DETR in multi-category synchronous detection scenarios with dense personnel and complex backgrounds, and have shortcomings in cross-scenario generalization performance

2.3. The Development of the DETR Model and the Bottleneck in Adapting It to Security-Related Weapon and Personnel Object Detection Scenarios

The DETR framework revolutionized the object detection process with an end-to-end paradigm, but in the specialized adaptation to violent scenarios, there are still shortcomings: As the first end-to-end object detection transformer framework, DETR [34] uses a convolutional backbone network to extract features and implements the Hungarian matching algorithm for predicting boxes and real boxes using an encoder–decoder structure, discarding the traditional NMS post-processing. However, this model has significant limitations: DETR-R50 has an average precision (mAP) of only 42.0% on the COCO dataset and has very slow inference speed; DAB-DETR [35] introduces dynamic anchor boxes to improve positioning accuracy and proposes a novel deformable attention mechanism, achieving an mAP of 48.7% on the COCO dataset with DAB-Deformable-DETR-R50, but the improvement in inference efficiency is limited; Deformable-DETR [36] reduces computational cost through sparse attention, achieving an mAP of 46.2% on the COCO dataset with Deformable-DETR-R50, but is still unable to meet real-time detection requirements; DN-DETR [37] accelerates model convergence by injecting noise into the query vector during training, but still has problems in feature processing; DINO [38] combines a pre-sorting strategy, enabling DINO-Deformable-DETR-R50 to achieve an average accuracy of 50.9% on the COCO dataset. However, its computational cost is significantly higher, with a frame rate of only 5 frames per second, making its deployment on edge devices challenging. H-DETR [39] uses a hybrid matching strategy, but the additional branches bring a large computational burden.

RT-DETR [40] achieves a balance between real-time performance and accuracy by taking in three-level features from the backbone network, an internal scale interaction module (AIFI), and a cross-scale fusion module (CCFM). Compared with the mainstream models of the YOLO series, RT-DETR demonstrates relative advantages in multi-scale object synchronous detection, feature extraction of occluded objects, and anti-interference in complex backgrounds, thanks to the global context modeling capability of the transformer architecture. The end-to-end detection paradigm also avoids the precision loss caused by non-maximum suppression (NMS), making it more compatible with the core requirements of weapon and personnel object detection in security scenarios. While the YOLO series models still maintain strong market competitiveness in single-frame inference speed and single-category object detection accuracy, the two models form a complementary and competitive relationship in industrial application scenarios.

However, its core architecture still has three major limitations in addressing the specific challenges of weapon and personnel detection in public safety scenarios: (1) This network architecture fails to adequately bridge the semantic gaps and positional deviations between hierarchical features, restricting its ability to jointly model small-sized weapons such as knives and pistols with large objects of personnel. (2) The existing convolutional or attention modules are difficult to adapt to the common object occlusion and motion blur situations in security scenarios, lacking targeted filtering and compensation mechanisms for local effective information, resulting in reduced feature robustness for incomplete objects. (3) The global modeling of RT-DETR mainly relies on the decoder attention mechanism, and it still needs further improvement in multi-scale feature processing, fine feature extraction, and feature calibration capabilities.

2.4. Shortcomings of Existing Research and Positioning of This Paper

Based on the above analysis, the current methods have made significant progress in the field of general object detection and violent behavior recognition, but there are still three core technical bottlenecks in the direction of weapon and person object detection in public security scenarios:

(1)Insufficient ability of multi-scale feature fusion: the feature fusion of general object detection relies on fixed scale weighting (such as EfficientDet [27]) or simple cross-layer splicing (such as YOLO series [28,29,30,31,32,33]), which cannot take into account the detailed characteristics of small weapons such as knives and pistols and the global characteristics of large personnel objects. It hinders the synchronous and high-precision detection of multi-scale security objects.
(2)Insufficient robustness of incomplete object detection: existing models only rely on a single convolution or sparse attention (such as Deformable-DETR [36]) for feature extraction of occluded and blurred incomplete objects, and lack a hierarchical feature compensation mechanism, which leads to serious missed detection and false alarm of occluded weapons and overlapping personnel objects.
(3)Low object discrimination under complex background: global semantic modeling mostly relies on a single attention mechanism (such as the RT-DETR decoder [40]), which does not combine the collaborative calibration of context and spatial features. In security scenes with dense human flow and cluttered environment, it is difficult to effectively distinguish foreground objects from complex backgrounds.

The improvements of the existing methods to the above three problems are mostly independent optimization, and the lack of cross-dimensional collaboration mechanism cannot adapt to the actual engineering requirements of “coexistence of small arms and personnel large objects, occlusion and complex background superposition” in security scenes. For example, YOLOv8 [30] only realizes the simple stacking of feature dimensions, does not solve the fault problem between high-level semantics and low-level details, and does not correlate the feature compensation requirements in the occlusion scene. Deformable-DETR [36] is only optimized for a single occlusion pain point, and does not combine multi-scale semantic enhancement to distinguish between “occluded weapon objects” and “normal background objects”, nor does it consider the background interference in complex security scenes. RT-DETR [40] only relies on high-level features, does not calibrate the spatial deviation of multi-scale features, and does not compensate for the impact of incomplete features caused by occlusion on object–background discrimination.

Aiming at the above technical bottlenecks, this paper focuses on the end-to-end high-precision detection and localization task of security-related objects such as weapons and personnel in the static images of security scenes, and proposes a dedicated object detection model, PDGV-DETR. The overall architecture of the model and the design principles of the three core innovative modules are elaborated in the following. The performance of the model is verified through multi-dataset comparison experiments and ablation experiments, and its engineering implementation value in real security scenarios is analyzed.

3. Method

In response to the three core limitations of the aforementioned general object detection model in the weapon and personnel detection task in security scenarios, namely the insufficient multi-scale fusion, poor robustness of incomplete objects, and low distinction between object and background, this study proposes the PDGV-DETR detection framework, with the overall architecture shown in Figure 1. This model adopts an end-to-end design and consists of three core units: the backbone network (Backbone), the encoder (Encoder), and the decoder (Decoder). The input of the model is a single-frame static image from security surveillance, and the output is the category probability, pixel-level bounding box, and IoU confidence of three types of objects (guns, knives, and people), without behavior-level classification output. The core task is the detection and positioning of security threat objects.

The backbone network performs multi-scale feature extraction on the input image and generates three feature maps: $[eqn]$ , $[eqn]$ and $[eqn]$ . The encoder achieves cross-scale feature fusion through the bidirectional mixed feature pyramid (DWH-FPN) and conducts context modeling using the global semantic weaving and elastic feature alignment network (GSWFN). The decoder integrates multi-scale feature outputs through deformable attention mechanisms to obtain the object category, bounding box, and IoU confidence.

PDGV-DETR achieves significant performance enhancements through a three-module closed-loop design. The bidirectional hybrid feature pyramid network (DWH-FPN) specifically addresses the “multi-scale semantic discontinuity” issue, and through transposed convolutional bidirectional interaction and channel attention, realizes the deep binding of high and low-level features. The dynamic hierarchical channel interactive convolution module (BasicPCock) solves the “incomplete feature compensation” problem, and through the dual-dimensional collaboration of the fine convolution in the main path and the adaptive mapping in the shortcut path, repairs the feature breaks caused by occlusion. The global semantic weaving and elastic feature alignment network (GSWFN) addresses the “object–background calibration” issue, and through multi-scale context modeling and spatial elastic alignment, eliminates the interference of complex backgrounds on object recognition.

The three modules are not independently optimized but form a technical closed loop of “multi-scale feature fusion → incomplete feature repair → background interference filtering”. The multi-scale features output by DWH-FPN provide the semantic basis for the incomplete feature compensation of BasicPCock, the complete features after repair by BasicPCock are the noise-reduced features for GSWFN’s object-background calibration, and the calibrated features by GSWFN feed back to the decoder to improve detection accuracy, effectively providing a collaborative solution to the three major pain points of “multi-scale—occlusion—complex background” in weapon and personnel object detection in security scenarios. This systematic optimization approach differs from the single-module improvement mode and is more in line with the complexity and uncertainty of security scenarios, providing technical support for the engineering implementation of security threat object detection.

3.1. Backbone Network

In Figure 1, the backbone network is used to extract the initial features of the input image. The image first passes through three convolutional layers and maximum pooling, generating feature map $[eqn]$ . $[eqn]$ then passes through four BasicBlock feature extraction layers to generate multi-scale feature map $[eqn]$ . After processing by the BasicPCock module, feature maps $[eqn]$ and $[eqn]$ are obtained. The BasicPCock module adapted in this paper is a core structural enhancement. Compared with the original BasicBlock module, we have added some convolutional layers for improvement. By integrating dynamic convolution strategies, the feature extraction ability of the module is enhanced, making $[eqn]$ and $[eqn]$ have stronger feature representation capabilities. The detailed structure diagram is shown in Figure 2.

Here, let the input feature map be denoted as $[eqn]$ , where $[eqn]$ , H, and W represent the number of input channels, the spatial height, and the spatial width, respectively, while $[eqn]$ represents the feature map output after processing by the BasicPCock module. After entering the BasicPCock module, $[eqn]$ undergoes dual-path parallel processing.

After $[eqn]$ enters the main feature extraction path (Main path), it extracts spatial features through a 3 × 3 convolution layer. After batch normalization (BatchNorm2d) and ReLU activation, the intermediate features are obtained, denoted as $[eqn]$ . Its mathematical expression is given by Formula (1).

[eqn]

Here, BatchNorm2d() represents channel normalization, and ReLU is the activation function.

$[eqn]$ is divided into the first quarter and the last three quarters by channel dimension. Only the first quarter is subjected to a 3 × 3 convolution operation to obtain $[eqn]$ , while the last three quarters are directly mapped to $[eqn]$ . After the features of $[eqn]$ and $[eqn]$ are concatenated, batch normalization and ReLU activation are carried out to obtain the main path output $[eqn]$ , which is mathematically represented by Formula (2).

[eqn]

Here, cat represents channel concatenation, BatchNorm2d() represents channel normalization, and ReLU represents the activation function.

After the $[eqn]$ enters the shortcut path, different mapping operations are performed based on parameters such as step size to obtain the shortcut path output $[eqn]$ (i = 1, 2, 3). The specific judgment method is:

I. When the input channel and the output channel have a step size of 1, the shortcut value is true, performing identity mapping and directly passing the input. We represent it as $[eqn]$ , and its mathematical representation is Formula (3).

[eqn]

II. When the input channel and the output channel have a step size of 2 or are unequal, the shortcut value is false. In this case, different shortcut paths are constructed based on the variant value and step size. Among them, the value of the variant directly specifies the default value ‘d’. If not explicitly specified, the default value is used. When the step size = 2 and variant = ‘d’, downsampling is carried out through average pooling, and then the number of channels is adjusted by 1 × 1 convolution. We represent it as $[eqn]$ , which is mathematically expressed as Formula (4).

[eqn]

Here, AvgPool2d is average pooling, and $[eqn]$ is a 1 × 1 convolution operation.

Otherwise, the number of channels is directly adjusted through 1 × 1 convolution without downsampling. We represent it as $[eqn]$ . Its mathematical expression is Formula (5).

[eqn]

The output feature $[eqn]$ (i = 1, 2, 3) of the shortcut path is added to the main path feature $[eqn]$ by element, and the nonlinear expression is enhanced by the ReLU activation function to generate the output $[eqn]$ . Its mathematical representation is Formula (6).

[eqn]

In the backbone network, $[eqn]$ is passed into the subsequent DWH-FPN (feature pyramid network) as one of the multi-scale outputs, and the first BasicPCock module is simultaneously input. Since the module’s step size is equal to 2 and the input and output channels do not match, the variant = ‘d’. The shortcut path performs average pooling and 1 × 1 convolution to expand the channels, and the output is $[eqn]$ . The second BasicPCock module, due to channel matching and a step size of 1, uses an identity shortcut path to convert $[eqn]$ to $[eqn]$ . This mode is repeatedly applied to $[eqn]$ , and $[eqn]$ and $[eqn]$ are generated successively through two BasicPCock modules.

When the BasicPCock module was used to replace the BasicBlock at different positions and in different quantities in the original backbone network, it was found that replacing the two network modules in front of $[eqn]$ and $[eqn]$ had the best effect. From the perspective of the network structure, the existing architecture integrates the multi-level characteristics of BasicPCock and BasicBlock, enhances the feature expression ability of $[eqn]$ and $[eqn]$ , and is one of the important factors for improving the model’s robustness. The dual-path design of BasicPCock combines partial convolution and adaptive shortcut mechanism, which is different from the traditional dynamic convolution that only focuses on the dynamics of a single dimension (either spatial or channel). BasicPCock achieves dual-dimensional dynamic coordination of spatial detail preservation and channel redundancy compression through the combination of fine convolution in the main path and adaptive channel mapping in the shortcut path. Especially for the incomplete features of prohibited weapons in the obscured scenarios, through channel layer compensation, the feature breakage rate is reduced, and in the downsampling process, both dynamic channel interaction and spatial and semantic consistency are achieved. The introduction of partial convolution reduces computational redundancy and improves inference efficiency while ensuring feature quality. For incomplete weapon and person objects (such as obscured knives and partially visible human limbs), the module’s incomplete object detection capability can effectively avoid feature-breakage-induced missed detections and significantly improve the integrity of weapon and person object recognition.

3.2. Encoder

After entering the encoder, $[eqn]$ is processed and then sent to the bidirectional hybrid feature pyramid (DWH-FPN) network for encoding operations. The encoder is one of the core components of the PDGV-DETR model. Its main function is to achieve cross-scale feature fusion and enhance semantic representation capabilities. The bidirectional hybrid feature pyramid (DWH-FPN) and the global semantic weaving and elastic feature alignment network (GSWFN) in its structure enhance the ability of multi-scale fusion and solve the problems of information loss or misalignment that are prone to occur in traditional methods when fusing multi-scale features.

3.2.1. Bidirectional Hybrid Feature Pyramid (DWH-FPN)

A feature pyramid network (FPN) is a deep learning structure specifically designed to solve the problem of “multi-scale object features”. Among them, the classical network BiFPN [27] realizes cross-layer feature interaction through weighted fusion, but the weights are fixed. HS-FPN only supports top-down unidirectional semantic delivery [41]; we referred to the pyramid structure in Deformable-DETR [41] and adapted a bidirectional hybrid feature pyramid (DWH-FPN) in the encoder of this model. DWH-FPN introduces a combination of channel attention weighting and transpose convolution bidirectional mapping, which retains edge details during upsampling and suppresses background noise during downsampling. The specific structure is shown in Figure 3.

After the processed $[eqn]$ features and $[eqn]$ , $[eqn]$ enter the DWH-FPN module, the processed $[eqn]$ , after being weighted by channel attention (ChannelAttention_HSFPN) and 1 × 1 convolution, is upsampled by transposed convolution by a factor of 2 to obtain feature $[eqn]$ . It is then combined with the feature $[eqn]$ obtained by processing $[eqn]$ through channel attention and 1 × 1 convolution, fused by multiplying the attention weights and adding residuals, and enhanced by the RepC3 module to generate feature $[eqn]$ . Its mathematical representation is Formula (7).

[eqn]

The channel attention module (CA) dynamically generates a channel-wise weight vector $[eqn]$ . During feature fusion, this vector is spatially broadcasted to perform element-wise multiplication ( $[eqn]$ ) with the spatial feature map $[eqn]$ . In the formula, + represents residual addition, and RepC3 represents the operation of the RepC3 module.

$[eqn]$ undergoes transposed convolutional upsampling to obtain feature C4, and this is combined with Y3 obtained by channel attention and Conv1 × 1 dimensionality increase of P3. They are fused through multiplication by the attention weights and the addition of residuals. Then, the feature is enhanced by the RepC3 module to generate feature $[eqn]$ . Its mathematical representation is given by Formula (8).

[eqn]

Here, $[eqn]$ represents element-wise multiplication, + represents residual addition, and RepC3 represents the operation of the RepC3 module.

$[eqn]$ undergoes 3 × 3 convolution and 2 times downsampling to obtain feature $[eqn]$ . Feature $[eqn]$ is fused with $[eqn]$ generated by the top-down path by multiplying by the attention weights and adding with residuals. Then, it is enhanced by the RepC3 module to obtain feature $[eqn]$ . Its mathematical expression is Formula (9).

[eqn]

$[eqn]$ continues to undergo downsampling and is fused with the initial $[eqn]$ feature (weighted by channel attention) through the same mechanism to obtain feature $[eqn]$ . Its mathematical representation is given by Formula (10).

[eqn]

For the identification of prohibited weapon objects, the bidirectional feature interaction of DWH-FPN can deeply integrate the weapon textures and contour details in the low-level features with the “dangerous device” category information in the high-level semantics; for the positioning of personnel objects, the channel attention mechanism can prioritize the activation of the key feature channels of the personnel contours and limb areas, quickly stripping away the invalid background interference in complex scenes; and for the understanding of security scenarios, the efficient cross-layer fusion can retain the overall features of the scene environment (such as indoor or street) and the local details where the object is located, providing high-quality feature support for the judgment of the association between the object and the scene.

3.2.2. Global Semantic Weaving and Elastic Feature Alignment Network (GSWFN)

The feature maps $[eqn]$ , $[eqn]$ , and $[eqn]$ are processed by DWH-FPN, among which $[eqn]$ serves as the intermediate-scale feature to balance semantics and details and directly participates in decoding, while $[eqn]$ and $[eqn]$ enter the global semantic weaving and elastic feature alignment network, which is composed of the components of the context and spatial feature calibration network [42]. Through multi-scale context capture and attention weighting, the semantic correlation between violent objects and scene backgrounds is strengthened, avoiding the loss of details or insufficient global information caused by a single-scale context. The core improvement lies in clarifying the specific parameters and mechanism of “multi-scale pooling”. The specific structure is shown in Figure 4.

After $[eqn]$ is input, it first enters the CFC_CRB module, where 3 × 3 convolution is used for channel compression, and four grid-scale pooling operations are executed in parallel to capture context information within different ranges. Then, through the dual-branch attention mechanism, key vectors, value vectors and query vectors are obtained. The key vectors and query vectors are normalized by Softmax and weighted fused and reconstructed with the value vectors. The processed $[eqn]$ features are output through 3 × 3 convolution combined with the Tanh activation function. Its mathematical representation is Formula (11).

[eqn]

Here, $[eqn]$ , $[eqn]$ , and $[eqn]$ represent the query, key, and value matrices, respectively, where N is the flattened spatial resolution of the pooled grids, and $[eqn]$ is the channel embedding dimension. Tanh represents the activation function, $[eqn]$ represents the features processed by the CFC_CRB module, and Conv_3×3_ represents the convolution operation.

After $[eqn]$ and $[eqn]$ enter the spatial feature calibration module (SFC_G2), a two-path feature transformation is performed. $[eqn]$ maintains the number of channels through a 3 × 3 convolution operation to obtain $[eqn]$ . $[eqn]$ first expands the number of channels through 3 × 3 convolution and then undergoes bilinear upsampling to obtain $[eqn]$ . Subsequently, the features are concatenated, and an offset tensor is generated through a convolutional layer. A sampling grid is generated for the high-level feature $[eqn]$ to compensate for the spatial position deviation caused by downsampling. Grid sampling is carried out on the low-level feature $[eqn]$ to align it with the semantic space of the high-level feature. After that, dynamic fusion is performed. Finally, the formula for the enhanced multi-scale feature map $[eqn]$ is as follows.

[eqn]

Among them, $[eqn]$ represents the final output, and $[eqn]$ and $[eqn]$ denote the fusion weights.

In the association recognition of objects and environments in security scenarios, global semantic weaving can deeply associate “weapons such as knives and guns” with “scenarios such as streets and indoor enclosed spaces”, thereby reducing the false detection probability in complex scenarios. For the problem of ambiguous object and background features (such as small weapons in chaotic environments or overlapping personnel objects in crowds), the spatial calibration capabilities of the CFC_CRB and SFC_G2 modules can significantly enhance the distinction between object and background features, effectively reducing the false positive rate of weapon and personnel object recognition. More importantly, the global-scale feature association enhances the model’s adaptability to different security scenarios (from open squares to enclosed indoor spaces), laying the foundation for its stable application in diverse security scenarios.

It should be noted that this module can only optimize the confusion problem of object–background features in a single frame image, and cannot completely eliminate the visual feature confusion between different behavioral categories of personnel. The relevant quantitative analysis is detailed in Section 4.5.1.

3.3. Decoder

The decoder receives multi-scale feature maps $[eqn]$ (from the GSWFN module), $[eqn]$ (from the DWH-FPN) and $[eqn]$ (from the CFC_CRB module) from the encoder, as shown in the Decoder module in Figure 1. These features are first unified by 1 × 1 convolution to generate corresponding feature maps $[eqn]$ , $[eqn]$ and $[eqn]$ . The mathematical representation is given by Formula (13).

[eqn]

Here, $[eqn]$ , $[eqn]$ , $[eqn]$ are the feature maps after the unified channel count. $[eqn]$ represents a 1 × 1 convolution operation, which is used to uniformly adjust the number of channels of the input features to 256.

The unified features are used as the key-value pairs of deformable cross attention and interact with 300 query vectors Q. In each layer of decoding, the query vectors are first updated through self-attention (MultiHeadAttn) to generate the updated query state $[eqn]$ after self-attention.

Then, they are dynamically sampled from the multi-scale feature maps $[eqn]$ , $[eqn]$ and $[eqn]$ , and a new query state $[eqn]$ is generated through the feedforward network. Its mathematical expression is given by Formula (14).

[eqn]

Here, $[eqn]$ is the updated query vector after self-attention; MSDeformAttn is the multi-scale deformable attention mechanism, which enables the decoder to dynamically sample key information from multiple scales of feature maps; FFN is a feedforward neural network, usually consisting of two fully connected layers, used to enhance the nonlinear expression ability of features; and $[eqn]$ is the query state after the first layer of the decoder processing.

After three iterations, the final query vector generates three types of outputs:

The linear layer predicts the probability of 80 object classes, which is mathematically expressed as Formula (15).

[eqn]

Here, $[eqn]$ is a linear layer (fully connected layer), which maps the 256-dimensional query vector to an n-dimensional vector, where n is the number of dataset categories. cls is the predicted category probability vector.

The normalized bounding box coordinates box are output by the MLP, and their mathematical representation is Formula (16).

[eqn]

Here, $[eqn]$ is a multi-layer perceptron that maps the 256-dimensional query vector to the 4-dimensional bounding box coordinates (center point coordinates x, y, width w, height h), and Sigmoid is the activation function, which normalizes the coordinates to the range of [0, 1].

The third output independently predicts the IoU confidence iou, and its mathematical expression is Formula (17).

[eqn]

Here, $[eqn]$ is a linear layer that outputs a 1-dimensional confidence value.

The final detection result is achieved through the screening of the top 300 high-confidence bounding boxes of σ(cls) × σ(iou), enabling multi-scale object detection and forming an end-to-end link from features to results.

4. Experiments and Datasets

4.1. Experiment Settings

During the experiment, the implementation language of the method was Python 3.9, and the CUDA version was 11.7. The cloud server was equipped with 18 vCPUs (based on a 128-core AMD EPYC 9754 physical processor) and a single RTX 3090 (24 GB) GPU card.

The images were labeled in the YOLO format as required, and the dataset was divided into training set, validation set and test set with a ratio of 7:1:2. To ensure the fairness of performance evaluation of all comparison models and the uniqueness of experimental variables, the baseline experiment of this study did not introduce any additional image enhancement techniques to expand the training dataset. All models were trained, validated and tested based on the original RGB images of the dataset, and the experimental results could directly reflect the native detection performance of each model when input with the original security monitoring images.

4.2. Data Set

The selected dataset is the Violence-Image-Dataset, corresponding to paper [43], as well as the handgun dataset and knife dataset from the OD-WeaponDetection dataset, corresponding to papers [44,45], respectively. The People dataset [46] was also used for further experiments.

The Violence-Image-Dataset is a skeleton-based violent image detection dataset created by Zhang Peng. It contains violent monitoring content of multiple people in various scenarios, including indoors, outdoors, and in subways, banks, stores, roads, etc. The dataset includes RGB images and skeleton point images. There are a total of 3474 images in the dataset, with 2421 RGB images and 1053 skeleton images. This dataset is derived from online data, self-generated data, and data selected from public datasets, and is mainly used for violent image detection based on skeleton and RGB.

This study chose RGB images for the experiments. It must be explicitly clarified that although the dataset is named “Violence-Image-Dataset”, this study strictly utilizes only its bounding box annotations for spatial object detection. No temporal behavior labels, action sequences, or event classification annotations were included or used in the training or evaluation of our model. The original dataset provides labels such as “fighting” and “normal”. To prevent any misconception regarding the scope of our research, these labels are strictly treated as static object categories representing specific instantaneous body postures, rather than dynamic behavior attributes. These posture-based bounding box labels serve solely as the training and testing ground truth for the model’s personnel object localization. The standard for the “fighting” label is as follows. If the individuals in the image meet any of the following conditions based on the visual features of a single frame static image, they can be classified into this category: ① there is an antagonistic physical contact (such as hitting, tearing, choking, etc., violent actions) between two or more individuals; ② the individuals are holding prohibited weapons such as guns or knives, and their postures have a clear aggressive direction; or ③ the individuals have tense limbs and an attack-ready posture, and the scene has clear conflict environmental characteristics. The standard for the “normal” label is as follows. If the individuals in the image do not have the above violence-related behavioral characteristics, they only exhibit regular behaviors such as walking, sitting still, hugging, talking, and sports activities, without aggressive postures or holding prohibited weapons, and can be classified into this category.

It should be particularly noted that this model is based on the above labeling standards to complete the end-to-end detection and classification of the four types of objects in single-frame static images: “guns”, “knives”, “violent behavior individuals”, and “non-violent regular behavior individuals”. The labels are involved throughout the model training and inference output and are an important part of the core detection task of the model.

The OD-WeaponDetection dataset image was constructed by Alcasla Alberto Castillo Lamas and others. It contains a gun dataset and a knife dataset, which are derived from movies or surveillance videos. The gun dataset has 3000 items, and the knife dataset has 2078 items. This study conducted corresponding experiments on them separately. Two figures in the picture show the pistol dataset, including various scenarios and pictures and images of pistols on the internet. The last three figures show the knife dataset, including pictures of knives used in each scenario.

4.3. Evaluation Index

The effect evaluation indicators used in this study mainly include accuracy (Precision), recall (Recall), F1 score, average accuracy (mAP), and FLOPs. In different detection scenarios, the classification criteria for the detection results are as follows:

In the violence detection task, true positive (TP) refers to the samples of violent behavior that are correctly identified; false positive (FP) refers to the samples that mistakenly classify non-violent behavior as violent behavior; and false negative (FN) refers to the instances of violent behavior that are not detected. For the knife and gun detection scenario, TP corresponds to the correct identification of the object object instances, FP refers to the situation where non-object objects are wrongly classified as object objects, and FN refers to the instances of object objects that are missed. Based on these indicators, the core evaluation parameters of the experiment can be further derived. Accuracy is defined as the proportion of true examples among the predicted positive examples of the model, reflecting the reliability of the model’s predicted positive examples. Recall is the proportion of correctly predicted true examples, measuring the model’s ability to capture positive examples. Its mathematical representation is given by Formulas (18) and (19),

[eqn]

[eqn]

After obtaining Precision and Recall, the F1 score can be calculated by taking their harmonic mean. The F1 score ranges from 0 to 1, and the closer the value is to 1, the better the balance between Precision and Recall of the model. The F1 score is a core metric for evaluating the performance of classification models in machine learning. It combines the advantages of both Precision and Recall and effectively reflects the comprehensive performance of the model in classifying positive and negative samples. The mathematical definition of the F1 score is the weighted harmonic mean of Precision and Recall, and its mathematical representation is Formula (20).

[eqn]

After calculating the above indicators, the mAP value can be calculated. The mAP value consists of two parts, mAP50 and mAP50-95, of which the most important is mAP50, which represents the average accuracy rate of predictions for all categories when the IoU threshold is 0.5. MAP50-95 represents the mAP under multiple IoU thresholds. It takes 10 IoU thresholds within the range [0.5, 0.95], with a step size of 0.05, calculates the mAP for each of these 10 IoU thresholds, and then takes the average. The mathematical expression of mAP50 is Formula (21).

[eqn]

FLOPs refers to the total number of floating-point operations, which is used to quantify the computational load during the model inference stage and reflects the computational resource consumption during the model’s operation. Its core significance lies in the fact that the smaller the value, the lower the computational cost of the model, and it may have better operational efficiency on hardware devices. The floating-point operation volume in the convolution layer is mathematically expressed as Formula (22).

[eqn]

Here, H and W represent the dimensions of the feature map, $[eqn]$ is the number of input channels, $[eqn]$ is the number of output channels, and K is the size of the convolution kernel.

The mathematical representation of the floating-point operation volume in the fully connected layer is given by Formula (23).

[eqn]

Here, $[eqn]$ represents the number of input neurons, $[eqn]$ represents the number of output neurons, and the total FLOPs is obtained by summing up the computational amounts of all layers. The mathematical expression for this is Formula (24).

[eqn]

4.4. Contrast Experiment

To verify the cross-scenario generalization ability, category recognition accuracy, and spatial positioning accuracy of the model proposed in this paper in the security-related object detection and positioning tasks in our security scenarios, we conducted comparative experiments on the Violence-Image-Dataset, OD-WeaponDetection (including pistol and knife datasets), and People datasets. The strict meaning of this comparison is the localization accuracy and category recognition accuracy of security-related objects. Within the same task category, the YOLO series models, some DETR series models, other common object detection models, RT-DETR baseline models, and the PDGV-DETR proposed in this paper were trained and tested according to the unified experimental configuration. All models used the same training hyperparameters. The four datasets selected in this study fully cover the three core security-related objects (people, guns, and knives) in the security scenarios, as well as typical security monitoring scenarios such as conflict scenes, dense crowds, and complex indoor and outdoor environments. The performance verification across datasets can fully support the cross-scenario generalization evaluation of the model in the security object detection task. The comparison results are shown in Table 1.

By observing Table 1, it can be concluded that on the conflict scenario personnel detection dataset “Violence-Image-Dataset”, the model proposed in this paper achieved the optimal mAP50 of 85.9%, which was 1.2% higher than the baseline model RT-DETR (84.7%). The main reason is that the “bidirectional feature interaction” of DWH-FPN strengthened the binding between the personnel’s body details and the semantic categories, avoiding the semantic discontinuity caused by the simple concatenation of cross-layer features in the YOLO series. At the same time, the GSWFN module effectively reduced the interference from complex backgrounds. However, Deformable-DETR, EfficientDet, Faster R-CNN, and Mask-RCNN all encountered performance bottlenecks, with mAP50 remaining at around 70%. Among them, YOLOv10 and FCOS performed poorly on this dataset, with mAP50 being only 78.9% and 71.2% respectively.

In the Pistol dataset experiment, DINO achieved the best result of 88.6% with the global attention mechanism, and our model followed closely with an accuracy of 86.7%; Deformable-DETR and EfficientDet still had performance bottlenecks and performed poorly, but Faster R-CNN performed relatively well in the gun recognition task.

In the Knife dataset experiment, Mask-RCNN led with 94.9% performance, but had a high occlusion scene missed detection rate and slow inference speed; our model had an accuracy of 93.0%, and relied on the dynamic hierarchical channel interaction convolution module, significantly enhancing the robustness to incomplete objects, effectively avoiding the missed detection problem of occluded weapons, and having more advantages in practical security scenarios. In addition, YOLOv10 and YOLOv12 performed poorly in this dataset, and Deformable-DETR performed well.

In the People dataset experiment, our model led with an accuracy of 82.4%, the core reason being that the elastic feature alignment mechanism of the GSWFN module solved the “human contour overlap” problem, while the accuracy of EfficientDet, Faster R-CNN, and Mask-RCNN was unsatisfactory, and Deformable-DETR performed well. However, Faster R-CNN was limited by its two-stage detection structure and had insufficient efficiency in cross-scale feature fusion, resulting in poor performance in recognizing violent behaviors related to people, with an accuracy of only 65.1% in the Violence-Image-Dataset and only 64.5% in the Pistol dataset.

It is important to note that the performance advantage of RT-DETR over the YOLO series in this study is strictly limited to the specific task of “person, gun, and knife object detection and localization” in static images of security scenarios, rather than the absolute superiority in general object detection across all scenarios. In scenarios such as the general COCO dataset, single-category fast detection, and lightweight deployment on extreme edge devices, the YOLO series models still maintain strong competitiveness, forming a long-term technical complementation and competition relationship with RT-DETR, and there is no comprehensive overwhelming advantage of one architecture over the other. The core optimization of this study is based on the relative adaptability of RT-DETR in the security object detection task, further filling its technical gaps in multi-scale fusion, robustness of occluded objects, and discrimination in complex backgrounds.

The model in this study achieved the best results on two person object detection datasets, the Violence-Image-Dataset and the People dataset. Although the DINO model has a temporarily higher accuracy in single-category gun detection tasks, and the Mask-RCNN model has a temporarily higher accuracy in single-category knife detection tasks, the improvements of PDGV-DETR are centered around the core engineering requirements of security-related object synchronous detection in security scenarios, robustness of occluded objects, and anti-interference in complex backgrounds. It has systematic optimization advantages and exhibits robust and consistent performance in security-related object detection and localization in security scenarios, maintaining stable and excellent comprehensive performance in various security sub-scenarios such as people, guns, and knives. Compared to other general object detection models, which have significant fluctuations in accuracy across different datasets, this model has stronger cross-scenario generalization ability, effectively solving core pain points such as “missed detection”, “misjudgment”, and “poor scene adaptability” in actual security applications, and is a better solution for security-related object detection and localization tasks in security scenarios, such as weapons and personnel.

To verify the statistical significance of the performance improvement of the model in this paper, and to eliminate the interference of experimental variance caused by random seeds, this study was based on the Violence-Image-Dataset. Five different random seeds were set, and repeated experiments were conducted on the baseline RT-DETR and the PDGV-DETR in this paper. The results are shown in Table 2.

On the conflict scene personnel detection dataset “Violence-Image-Dataset”, the optimal mAP50 of the model in this study reached 0.864 in a single experiment, while the baseline experiment’s mAP50 was 85.9%. To verify the statistical significance of the performance improvement, this research set up 5 groups with different random seeds to conduct repeated experiments. After statistical verification, the average mAP50 of the proposed PDGV-DETR in this paper was recorded as the mean ± standard deviation of 0.858 ± 0.004, which was significantly better than that of the baseline model RT-DETR (0.840 ± 0.007). Regarding the inter-group differences of the two sets of data, this time an independent sample two-sided t-test was used for verification. The preset significance level α was 0.01. Finally, the obtained t-value was 4.79, the degrees of freedom df was 8, and the p-value was less than 0.01. This indicates that the difference between the two sets of data has a clear statistical significance. This result fully confirms that the performance improvement of PDGV-DETR is not due to random fluctuations, but rather a systematic and stable improvement brought about by the architecture optimization.

The core task of this study is the detection and localization of safety-related objects in a single-frame static image, without continuous temporal information support. Therefore, there is an inevitable rate of false positives. In sports scenes such as basketball, wrestling, and judo, the confrontational body postures of people and the visual features of conflict scenes are highly similar, which easily leads to false detection of personnel objects and misjudgment of scene attributes. Friendly embraces, playful interactions, etc., which do not involve violent body interactions, have overlapping static visual features with the static visual features of body conflicts, and have a certain probability of misjudgment. Items such as children’s toy guns, simulation guns, and craft knives, which are highly similar in appearance to prohibited weapons, are easily misjudged as gun or knife objects.

The aforementioned false positive rate is an inherent limitation of single-frame static image object detection technology and cannot be completely eliminated through feature optimization. It requires a secondary verification in combination with subsequent temporal behavior recognition models. This is the core reason why this paper clearly limits the research scope to object-level detection and positioning, and does not involve behavior-level classification.

4.5. Ablation Experiment

4.5.1. Discussion on Module Precision

To verify the gain effect of each innovative module in the security-related object detection and positioning tasks in the security scenarios, this study constructed a stepped ablation experiment: using RT-DETR as the benchmark, the bidirectional hybrid feature pyramid network (DWH-FPN), the dynamic hierarchical channel interactive convolution module (BasicPCock), and the global semantic weaving and elastic feature alignment network (GSWFN) were successively introduced, and their combination schemes were tested. Eventually, the complete framework of PDGV-DETR was formed, systematically verifying the effectiveness of each module in the security-related object detection and positioning tasks.

The dataset used was the Violence-Image-Dataset image test set. In the table, the correct and incorrect numbers represent whether each improvement was used. In the subsequent illustrations, RT-DETR + DWH-FPN is an innovation based on the basic model, adding the improved module of DWH-FPN. RT-DETR + BasicPCock and RT-DETR + GSWFN respectively add the part of convolution of BasicPCock and the feature fusion of GSWFN to the basic model for improvement. RT-DETR + DWH-FPN + BasicPCock is the effect after adding the improved module of DWH-FPN and the part of convolution of BasicPCock, and PDGV-DETR is the final improved result. The ablation experiment results are shown in Table 3.

The experimental results show that the mAP50 value of the proposed PDGV-DETR in this paper reached 85.9%, which is 1.2% higher than that of the baseline RT-DETR. The mAP50-95 value reached 64.3%, which is also 1.2% higher than the baseline. The precision and recall rates also increased by 1.6% and 2.0% respectively. Especially in the “fighting” classification, the mAP50 value increased by 1.5%, fully demonstrating the effectiveness and practicality of each improvement module in the security-related object detection and positioning in the security scenarios. The following are the visualized graphs of the evaluation results of each experiment.

Figure 5 shows the comparison of the [email protected] convergence curves of the RT-DETR baseline model and the PDGV-DETR model during the training process from 0 to 150 rounds. In the initial training stage (Epoch 0-50), PDGV-DETR demonstrated superior feature learning efficiency and cold-start capability. At the 1st training round, PDGV-DETR’s [email protected] reached 0.053, which is 88.7% higher than the 0.028 of the RT-DETR baseline, and its initial fitting speed was significantly ahead; by the 50th training round, PDGV-DETR’s [email protected] reached 0.754, while RT-DETR was only 0.746. This verifies that the feature interaction module designed in this paper can accelerate the model’s feature capture of personnel and weapon objects in security scenarios, effectively shortening the initial convergence cycle.

In the middle and later stages of training (Epoch 50–150), the convergence characteristics of the two models show significant differentiation. PDGV-DETR exhibits high training stability. During the plateau period from Epoch 100 to 150, the curve of the RT-DETR baseline shows obvious oscillations, with the maximum fluctuation range of [email protected] reaching 0.023 (fluctuation range 0.791–0.814), and reaching a local peak of 0.814 in a high-variance form. In contrast, the curve of PDGV-DETR is smooth throughout, with [email protected] steadily maintained at around 0.802, with a fluctuation range of only 0.004 (fluctuation range 0.798–0.802), and the fluctuation rate is significantly reduced by 83.5% compared to the baseline. It is particularly worth noting that although the baseline model obtained slightly higher local values on the training validation curve due to intense oscillations, this high variance characteristic often indicates that the model has overfitted the training distribution, resulting in a loss of its generalization ability; while PDGV-DETR’s stable convergence pattern proves that the multi-module collaborative optimization in this paper effectively filters out the gradient noise interference caused by complex backgrounds and occluded objects. This outstanding anti-overfitting ability and training stability clearly explain why PDGV-DETR ultimately achieved better actual detection accuracy (0.859 compared to 0.847) on the test set (as shown in Table 1), achieving a dual improvement in training stability and real-world generalization.

Figure 6 shows the Precision–Recall curves of the experiment, with the horizontal axis representing the recalllevel of the prediction results and the vertical axis representing the accuracy rate of the prediction results. The purpose of the Precision–Recall curve is to study the performance of the classification model for different recall levels of the prediction results. In the figure, “fighting” refers to the object of violent behavior individuals, and “normal” refers to the object of non-violent regular behavior individuals, which are the core detection classification categories of the model, corresponding to the classification results of the violent/non-violent behavior attributes of the individuals.

Figure 6b–d are the Precision–Recall (P-R) curves corresponding to the integrated models of individual innovative modules. The P-R curves in these subfigures show a varying upward trend compared to the baseline model in Figure 6a, indicating that under the same recall rate conditions, the precision rate (Precision) of the model has been significantly improved—directly verifying the effectiveness of each individual module in optimizing the detection performance.

Figure 6e shows the P-R curve of the model integrating two modules. This curve is closer to the upper right corner of the coordinate system and has better overall performance than the single-module model. The optimization of the curve shape confirms the synergy effect of the multi-module combination: by integrating the advantages of different modules (such as feature fusion or incomplete feature repair), the model achieves a better balance between precision rate and recall rate, further improving the comprehensive detection performance.

Figure 6f presents the performance of the proposed PDGV-DETR model in this paper. Its mAP50 reaches 0.859, which is the highest among all comparison models. This result fully validates the effectiveness of the closed-loop design of the three core modules of the PDGV-DETR model—by integrating multi-scale feature fusion, incomplete feature repair, and background interference filtering functions, the model ultimately achieves the optimal detection performance.

Figure 7 shows F1–Confidence curves. An F1–Confidence curve is typically used to evaluate the performance of object detection models at different confidence threshold values. It depicts the curve relationship between the F1 score and the confidence threshold value. Both curves show an “ascending first and then descending” arch shape, which conforms to the inherent pattern of F1–Confidence curves: at low confidence levels, the predictions are mixed (with more false positives), resulting in a lower F1 score; as the confidence level increases, the proportion of effective predictions increases, and the F1 score rises; and at very high confidence levels, the screening is too strict (with more missed detections), causing the F1 score to drop. This indicates that the prediction logic of the two models is reasonable and can filter out valid results based on the confidence threshold value.

By comparison, it can be seen that the F1 peak of the PDGV-DETR in this study is 0.83 (higher than 0.81 of RT-DETR), and the corresponding confidence threshold value at the peak is 0.545 (lower than 0.576 of RT-DETR). This result indicates that PDGV-DETR can achieve higher detection accuracy at a lower confidence threshold value, demonstrating a better prediction confidence level, and also has an advantage in balancing accuracy and inference efficiency, making it more suitable for the dual requirements of “accuracy” and “real-time performance” in security scenarios.

Figure 8 compares the detection box results of RT-DETR and PDGV-DETR. Figure 8a shows the detection output of RT-DETR (including bounding boxes), and Figure 8b shows the detection output of PDGV-DETR.

In street scenes, RT-DETR exhibits obvious positioning deviations and classification defects: the bounding boxes of people are offset, and some scene elements are missed. PDGV-DETR relies on the elastic alignment mechanism of the SFC_G2 module to achieve pixel-level alignment of bounding boxes, and adds 2 precise detection boxes for occluded objects, without missed detections and with accurate positioning.

In indoor scenes, RT-DETR mistakenly labels sedentary regular non-violent behavior individuals as violent conflict behavior individuals, and the bounding boxes of people are offset. PDGV-DETR not only does not have such behavior attribute misclassification, but can also accurately identify the non-violent behavior attributes of the individuals. The object detection coverage is complete, and the bounding box and the object have a high degree of alignment.

To quantitatively analyze the object classification confusion situation of the model in complex scenes and verify the optimization effect of the GSWFN module on misjudgments in complex backgrounds, this study generates confusion matrices of the baseline RT-DETR and the PDGV-DETR model based on the Violence-Image-Dataset test set, and the results are shown in Figure 9 and Figure 10.

To quantitatively analyze the object classification confusion situation of the model in complex security scenarios and verify the optimization effect of the GSWFN module on misjudgments in complex backgrounds, this study generated confusion matrices for the baseline RT-DETR and the PDGV-DETR model based on the Violence-Image-Dataset test set. By comparing the experimental data in Figure 9 and Figure 10, it can be found that PDGV-DETR has achieved a significant improvement in reducing false positive detections and missed detections. In single-frame static image detection, due to the lack of temporal context information, the visual features of different behavior attributes of people are highly similar. The experimental results show that the baseline RT-DETR model misclassified “fighting” (violent behavior personnel) as “normal” (non-violent regular behavior personnel) with a normalized proportion of 0.05 (16 instances), while the PDGV-DETR model successfully reduced this misclassification proportion to 0.03 (11 instances). At the same time, the absolute number of correctly predicted “fighting” instances increased from 319 to 321. This demonstrates that the PDGV-DETR model is more sensitive to capturing violent features and less likely to miss true conflict events.

In terms of background interference suppression capability, the GSWFN module significantly enhances the model’s robustness in object recognition in chaotic security scenarios and effectively reduces background false alarms. Observing the original numerical matrix, it can be seen that the baseline RT-DETR model mistakenly detected “background” as “fighting” 37 times (normalized proportion 0.08), while PDGV-DETR significantly reduced this misdetected number to 21 times (normalized proportion 0.05). For the background false alarms of the “normal” category, the baseline model misdetected 430 times, while the model in this study reduced it to 374 times. Overall, the total number of background false positive bounding boxes dropped from 467 in the baseline to 395 in our model. This firmly verifies the GSWFN module’s ability to suppress complex background interference and enhances the model’s object recognition robustness in chaotic security scenarios. In conclusion, PDGV-DETR significantly outperforms RT-DETR in bounding box positioning accuracy and category recognition accuracy of security-related objects. The experimental results fully validate the three core structural adaptations integrated in this study, which through the technical loop of “fragmented feature restoration, multi-scale semantic binding, and global context completion”, systematically improves the model’s anti-interference ability in real complex scenarios and better meets the core requirements of accurate positioning of weapon and personnel objects in security scenarios.

4.5.2. Cost–Benefit Analysis of Computational Overhead and Demonstration of Deployment Adaptability

This section addresses the core issues raised by the reviewers regarding the cost–benefit trade-offs of architecture complexity, computational overhead, and accuracy improvement, as well as the balance between real-time performance and accuracy. Based on the experimental data, it systematically verifies the rationality and practical deployment value of the combination of the three innovative modules from four dimensions: quantitative analysis of deployment core indicators, visualization of accuracy-delay trade-off, marginal benefit demonstration, and engineering implementation adaptability.

Quantitative Analysis of Core Indicators for Deployment of the Abandonment Model

To clarify the independent effects and collaborative effects of each innovation module on the computational cost, inference efficiency and detection accuracy of the model, this section describes an ablation experiment based on the experimental scheme, and statistically calculates the core deployment indicators of each model. The results are shown in Table 4. All indicators were tested under the same experimental environment.

From the quantitative data in Table 4, it can be seen that each innovation module has a significantly differentiated impact on computing resources and detection performance, forming a good technical loop. After introducing the BasicPCock module, the model demonstrated a significant lightweight advantage, with its parameter quantity dropping from the baseline RT-DETR’s 19.88 M to 14.35 M, a reduction of 27.8%. At the same time, GFLOPs dropped from 56.9 to 49.9, a decrease of 12.3%, and the single-frame inference time was reduced from 6.9 ms to 6.1 ms. The lightweight design of this module effectively mitigated the increase in computational overhead brought by the DWH-FPN and GSWFN modules, resulting in the final complete PDGV-DETR model having only 16.34 M parameters, a 17.8% decrease from the baseline, significantly alleviating the memory occupancy pressure on edge devices.

Although the introduction of DWH-FPN and GSWFN modules led to varying degrees of growth in GFLOPs, their core value lies in solving the engineering pain points in security scenarios, such as DWH-FPN enhancing the synchronous detection capability of multi-scale objects, while GSWFN significantly reducing the false detection rate in complex backgrounds. The final PDGV-DETR model achieved the highest mAP50 of 0.859 on the Violence-Image-Dataset test set, an increase of 1.2% compared to the baseline. In terms of computational overhead, PDGV-DETR’s GFLOPs were 59.5, a slight increase of 4.6% compared to the baseline; the single-frame full-process inference time was 8.0 ms, only an increase of 0.6 ms compared to the baseline’s 7.4 ms. This extremely low computational cost was achieved at the expense of improved recall rate and reduced false detection rate, proving that the model’s computational efficiency-to-accuracy conversion is at a relatively high level.

From the balance between real-time performance and accuracy, to meet the real-time video stream processing requirements of mainstream security surveillance at 30 FPS, industry thresholds typically require the single-frame full-process inference time to be ≤30 ms (slightly lower than the theoretical upper limit of 33.3 ms). And PDGV-DETR’s 8.0 ms time consumption is far below this security boundary, accounting for only 26.7% of the threshold. This indicates that PDGV-DETR ensures complete real-time performance while further optimizing detection accuracy and scene robustness, fully meeting the core requirements of security surveillance scenarios. PDGV-DETR is in the optimal range of the accuracy–delay trade-off curve, achieving maximum accuracy within a controllable delay range, and its security value and engineering implementation value far outweigh the marginal increase in computational overhead, making it a better solution for adapting to actual deployment scenarios in security.

Visualization of Accuracy–Delay Trade-Off and Marginal Benefit Analysis

To visually verify the comprehensive balance ability of the model in terms of “real-time performance—accuracy”, this section plots the scatter plot of the single-frame inference time and mAP50 performance of each model in the ablation experiment (Figure 11). The x-axis represents the single-frame pure inference time (unit: ms), and the y-axis represents the mAP50 detection accuracy of the model on the Violence-Image-Dataset test set. The upper right corner of the coordinate system indicates the optimal performance range of “high accuracy—low latency”. The closer the scatter points are to this range, the stronger the model’s comprehensive adaptability in the real-time security detection scenario.

From the scattered distribution, it can be seen that the baseline RT-DETR is located in the lower left corner of the coordinate system, being the benchmark point with the lowest inference time and the lowest accuracy among all models; the single-module improved models are all distributed in the upper right corner of the baseline, achieving a linear trade-off between accuracy improvement and a slight increase in latency; the dual-module combined model (RT-DETR + DWH-FPN + BasicPCock) further approaches the optimal range; and PDGV-DETR is located at the top right corner of all scattered points, being the best-performing solution among all ablation models, achieving maximum accuracy within a controllable range of latency.

During the evolution from single-module to multi-module, the “precision/latency ratio” (the increase in precision per unit latency) of the models has always maintained high returns: the BasicPCock module achieved negative latency growth + precision improvement, with the marginal benefit being the highest among all modules; the dual-module combination of DWH-FPN and BasicPCock increased mAP50 by 0.9% for every 1 ms increase in latency; and PDGV-DETR introduced the GSWFN module on the basis of the dual-module, and for every 0.1 ms increase in latency, mAP50 increased by 0.3%, without showing a decrease in marginal benefit, verifying the synergistic gain effect of the three-module technical loop.

The detection of weapons and personnel in public security scenarios belongs to safety-critical applications rather than general entertainment-oriented computer vision tasks. The value of accuracy improvement cannot be measured solely by the numerical increase in mAP; the core lies in the reduction of false detection rates, which brings about risk avoidance in security.

The three-module combination proposed in this paper, while only increasing 4.5% of computational overhead and fully meeting the real-time requirements of latency, not only achieved a stable 1.2% increase in mAP, but more importantly, solved the four core pain points of weapon and personnel detection in security scenarios, significantly improving the generalization, robustness, and engineering adaptability of the model. Its security value and engineering value far outweigh the increase in marginal computational overhead, making it the optimal solution for adapting to actual deployment scenarios in security. At the same time, this study clearly indicates that current models have not conducted real measurement adaptation for low-power edge devices, and the verification of related edge deployment performance will be completed in subsequent research. There will be no excessive claims or misleading statements.

4.5.3. Robustness Supplementary Verification Experiment

For the degradation input problem in real monitoring scenarios, we conducted a special robustness verification experiment to simulate the common input degradation of real hardware on the test set. We compared the performance degradation of RT-DETR and PDGV-DETR. The specific contents included:

Gaussian noise interference: Add Gaussian noise with σ = 5, σ = 10, and σ = 15 respectively. Color space conversion: Convert the RGB image to a single-channel grayscale image. Resolution degradation: Downsample the input image from 640 × 640 to 320 × 320 and 160 × 160. The specific results are shown in Table 5.

In all simulated degradation input scenarios, the detection accuracy of PDGV-DETR was consistently superior to that of the baseline RT-DETR model. Gaussian noise directly destroys the edges and texture key features of the object, easily causing missed detections of small weapon objects and incorrect judgments of human features. Experimental results show that as the noise intensity increases, the performance advantage of PDGV-DETR continues to expand: at a mild noise level of σ = 5, its mAP50 is 0.847, 1.2% higher than RT-DETR; while with a strong noise interference of σ = 15, PDGV-DETR’s mAP50 still remains at 0.708, 3.3% higher than the baseline model (0.675). This advantage mainly benefits from the adaptive repair ability of the BasicPCock module for incomplete features, combined with the global context association of the GSWFN module, achieving “local feature damage, global semantic completion”, significantly alleviating the sharp decline in accuracy in the presence of strong interference.

Single-channel grayscale images are the common input form for old simulation equipment and night infrared monitoring, and general models often show significant performance degradation when lacking color features. Table 5 data indicates that under grayscale image input, PDGV-DETR’s mAP50 still remains at 0.842, 0.7% higher than RT-DETR, and the attenuation of its baseline performance (0.859) under high-quality RGB input is extremely small, demonstrating low dependence on color information. This verifies that the DWH-FPN module effectively binds the underlying structure of the object with high-level semantics through deep interaction of high- and low-level features, enabling the model to directly and seamlessly adapt to legacy grayscale monitoring equipment.

In addition, long-distance monitoring or video stream compression often leads to a significant reduction in the effective pixels of the object. When the image is downsampled to 320 × 320, PDGV-DETR achieves an mAP50 of 0.856, 1.4% higher than the baseline; even in the extreme scenario of 160 × 160 ultra-low resolution, its mAP50 still reaches 0.832 (0.6% higher than the baseline). This proves that the fine convolution design of some channels in the BasicPCock module and the multi-scale fusion mechanism of DWH-FPN can accurately capture key features even in severely lacking pixel conditions.

Robustness verification experiments show that PDGV-DETR not only achieved significant accuracy improvements in high-quality ideal datasets, but also systematically improved its anti-interference ability in harsh conditions through the technical loop of “incomplete feature repair, multi-scale semantic binding, and global context completion”. Compared to the baseline model, it shows a lower performance attenuation in various extreme visual degradation scenarios, providing reliable technical support for the large-scale and stable deployment of various legacy and incremental monitoring hardware in public security scenarios.

4.6. Architectural Distinctiveness and Complexity Discussion

To explicitly articulate the architectural advancements of PDGV-DETR and address the distinctions from existing prior models, this section contrasts the adapted components (BasicPCock, DWH-FPN, and GSWFN) directly against their original vanilla counterparts. The structural differentiation and complexity impacts are summarized in Table 6.

To visually demonstrate the necessity of these structural adjustments, Figure 12 presents a direct quantitative comparison of these modules.

As shown in Figure 12a, directly deploying the original HS-FPN to our security scenario would result in a significant drop in accuracy (mAP dropping from 0.847 to 0.833), which proves that the one-way top-down semantic transfer is insufficient to handle multi-scale threat detection. Instead, our proposed DWH-FPN achieved the highest accuracy (0.854) with only a slight increase in computational cost, verifying the effectiveness of the bidirectional hybrid reconstruction. Additionally, Figure 12b visualizes the computational complexity optimization implemented by BasicPCock. We did not blindly replace all convolutional layers but conducted a structural depth analysis. The chart reveals that “two-layer replacement” is a clear “sweet spot”: at this setting, the network reached a mAP peak of 0.853, while significantly reducing the computational cost to 49.9 GFLOPs. This confirms that BasicPCock effectively eliminates spatial redundancy without sacrificing key feature representation, constituting the lightweight foundation of the PDGV-DETR architecture.

5. Conclusions

To enhance the accuracy, real-time performance, and generalization of the detection and positioning of security-related objects such as weapons and personnel in static surveillance images in complex security scenarios, this study proposes the PDGV-DETR model for the detection of weapons and personnel objects in security sites. Through the collaborative optimization of three core modules, it addresses the core pain points faced by traditional general object detection models in the security-related object detection task of static security images, such as the difficulty in capturing multi-scale objects, the easy omission of incomplete objects, and the misjudgment in complex backgrounds. This study is limited to the task of object-level detection and positioning in static images and does not involve the identification of violent behaviors based on temporal information. These two tasks belong to different technical tasks and have no horizontal performance comparability. By constructing a technical loop of “multi-scale fusion → incomplete feature restoration → background interference filtering”, the detection performance has achieved systematic enhancements.

The model first enhances the bidirectional interaction of high and low-level features through the Dual Hybrid Feature Pyramid Network (DWH-FPN), deeply binding the detailed textures of small-scale weapons and the semantic of dangerous categories; secondly, it introduces the Dynamic Hierarchical Channel Interactive Convolution Module (BasicPCock), which significantly improves the feature compensation ability for occluded and incomplete objects while effectively reducing the computational complexity; finally, it combines the Global Semantic Weaving and Elastic Feature Alignment Network (GSWFN), significantly reducing the probability of false detection and omission of objects in crowded and complex environments.

The systematic experiments conducted based on four typical security datasets fully validate the superiority of PDGV-DETR. In the two tasks of conflict scene person detection in the Violence-Image-Dataset test set and regular person detection in the People dataset, the peak mAP50 of this model reached 85.9% and 82.4% respectively. In the knife detection task of the dedicated weapon dataset, its mAP50 even reached 93.0%. The statistical verification based on multiple random seeds consistently confirmed the significant improvement in model performance. The result of PDGV-DETR was 0.858 ± 0.004, while the baseline model was 0.840 ± 0.007, verifying the reliability and practical effectiveness of the current architecture improvement. Compared with the YOLO series and other mainstream general detection models, PDGV-DETR not only achieves a significant accuracy improvement under a slight increase in computational cost, but also demonstrates strong scene robustness and operational stability in various complex hardware and degraded input conditions, effectively fulfilling the dual strict requirements of accuracy and real-time performance for security applications.

This study provides a reproducible and implementable technical framework for the detection and positioning of security-related objects such as weapons and personnel in static surveillance images in complex security scenarios. Its core modules (DWH-FPN, BasicPCock, GSWFN) can be transferred to other object detection fields such as industrial defect detection and medical image analysis, and have broad cross-domain application potential.

The proposed PDGV-DETR model can provide the security system with precise detection and positioning capabilities for weapons and personnel and provide core object-level pre-processing technical support for early warning of violent risks. Future research will be carried out in three directions: First, to further optimize the robustness of security-related object detection in extreme occlusion and low-light conditions. Second, in order to enhance the practical engineering value, we will design a system quantization and deployment roadmap for low-power edge devices (such as NVIDIA Jetson Nano and Raspberry Pi). This roadmap will first apply post-training quantization (PTQ) to convert model weights into INT8/FP16 precision format to compress the model size. Subsequently, we will introduce quantization-aware training (QAT) to restore any precision loss caused by low-precision operations, and we plan to use the TensorRT framework to accelerate the inference process and quantitatively verify real-time performance metrics such as frame rate (FPS), power consumption, and inference latency on these low-power devices. Third, to expand the temporal modeling capability of the model by integrating RWF-2000, Hockey Fight and other violent behavior detection standard video datasets, completing the cross-modal adaptation and verification of the model, and achieving an end-to-end technical closed-loop for the “safe-related object detection—violent behavior recognition” process of violent risk early warning.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Rojas-Andrade R. Lopez Leiva V. Varela J.J. Soto García P. Álvarez J.P. Ramirez M.T. Feasibility, acceptability, and appropriability of a national whole-school program for reducing school violence and improving school coexistence Front. Psychol.202415139599010.3389/fpsyg.2024.139599038979066 PMC 11228332 · doi ↗ · pubmed ↗
2Onat I. Bastug M.F. Guler A. Kula S. Fears of cyberterrorism, terrorism, and terrorist attacks: An empirical comparison Behav. Sci. Terror. Polit. Aggress.20241614916510.1080/19434472.2022.2046625 · doi ↗
3Rajan S. Buttar N. Ladhani Z. Caruso J. Allegrante J.P. Branas C.J. School violence exposure as an adverse childhood experience: Protocol for a nationwide study of secondary public schools JMIR Res. Protoc.202413 e 5624910.2196/5624939196631 PMC 11391155 · doi ↗ · pubmed ↗
4Xu W. Zhu D. Deng R. Yung K.L. Ip A.W.H. Violence-YOLO: Enhanced GELAN algorithm for violence detection Appl. Sci.202414671210.3390/app 14156712 · doi ↗
5Omarov B. Narynov S. Zhumanov Z. Gumar A. Khassanova M. State-of-the-art violence detection techniques in video surveillance security systems: A systematic review Peer J Comput. Sci.20228 e 92010.7717/peerj-cs.92035494848 PMC 9044356 · doi ↗ · pubmed ↗
6Zahra A. Ghafoor M. Munir K. Ullah A. Ul Abideen Z. Application of region-based video surveillance in smart cities using deep learning Multimed. Tools Appl.202483153131533810.1007/s 11042-021-11468-w 34975282 PMC 8710820 · doi ↗ · pubmed ↗
7Zhao X. Wang L. Zhang Y. Han X. Deveci M. Parmar M. A review of convolutional neural networks in computer vision Artif. Intell. Rev.2024579910.1007/s 10462-024-10721-6 · doi ↗
8You Y. The impact of deep learning on computer vision: From image classification to scene understanding Int. J. Sci. Res. Manag.20241258435856