An Intelligent Obstacle Detection Method for Rail Transit Scenarios
Zhao Sheng, Tianyang Liu, Wei Shangguan, Yijing Wang, Yige Wang, Zhiyu He

TL;DR
This paper introduces ACX-YOLOv8, an improved object detection method for identifying obstacles on railway tracks, offering better accuracy and efficiency for real-time monitoring.
Contribution
The novel ACX-YOLOv8 integrates SCAM, CDConv, and an X6 detection head to enhance obstacle detection in railway environments.
Findings
ACX-YOLOv8 achieves 87.1% mAP50 on the test dataset, a 2.7% improvement over the baseline YOLOv8.
The model has 4.85 million parameters and maintains lightweight performance while ensuring detection precision.
It shows a 1.8% mAP50 improvement on the PASCAL VOC dataset, demonstrating strong generalization ability.
Abstract
Traditional signal equipment is incapable of real-time monitoring of foreign objects intruding into track zones. To effectively ensure the operational safety of trains, this paper presents an intelligent obstacle detection approach of visual sensing for railway track regions based on YOLOv8, named ACX-YOLOv8. Built upon the baseline YOLOv8 framework, the proposed method first incorporates the spatial coordinate attention mechanism (SCAM) to enhance the model’s ability to capture long-range dependencies and local fine-grained details, thereby improving its perceptual capacity and feature representation performance. Subsequently, the cascaded dilated convolution (CDConv) module is integrated to effectively extract multi-scale image features, strengthening the model’s capability to identify foreign objects in complex railway environments. Finally, an X6 decoupled detection head is devised…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13- —Science and Technology Research Project of China Railway
- —State Key Laboratory of Advanced Rail Autonomous Operation
- —Youth Research Project of the Signal and Communication Research Institute, China Academy of Railway Sciences Corporation Limited
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Railway Engineering and Dynamics · Railway Systems and Energy Efficiency
1. Introduction
Railway transportation is one of the modern means of transport. Characterized by high accuracy, strong continuity, fast speed, and massive traffic volume, it has gradually occupied an important position in China’s transportation industry. As a critical national infrastructure and the backbone of the comprehensive transportation system, railways represent an effective pathway for achieving intensive development and enhancing resource utilization efficiency. The “14th Five-Year” Railway Development Plan clarifies that China’s railway sector is currently in a critical stage of improvement and efficiency enhancement [1]. However, with the rapid expansion of the railway network, the challenges to railway operational safety are increasing. The intrusion of foreign objects into train operation sections is one of the primary causes of railway safety incidents [2]. In recent years, numerous foreign object intrusion accidents have occurred both domestically and internationally, resulting in severe casualties and economic losses. For instance, in 2015, a collision between a train and a car in northwest France killed three people; in 2018, a collision involving two trains in Germany left two people dead; and in 2021, Train K596 in mainland China struck construction workers, causing nine fatalities. These accidents and related data indicate that traditional safety protection detection along railway lines struggles to cope with sudden emergencies. Although comprehensive railway video surveillance systems can collect real-time video footage along the lines, due to limitations in analysis methods, information processing is often lagging, with data typically extracted only after an incident. Therefore, integrating comprehensive railway video surveillance systems with image processing technologies to explore real-time foreign object intrusion detection algorithms for proactive prevention is particularly crucial.
The most primitive method for foreign object intrusion detection is manual inspection [3]. This approach suffers from numerous drawbacks, including low efficiency, high cost, and significant susceptibility to environmental conditions and personnel status. It is difficult to achieve full coverage along railway lines at all times, and inspection work becomes particularly challenging under complex environments such as severe weather. However, with the continuous development of communication and automation technologies, contact-based detection technologies have been widely adopted. Facing threats from natural disasters, the Japanese Shinkansen high-speed railway introduced an automated management system incorporating optical fiber sensors, which assesses potential danger levels by monitoring optical signal attenuation [4]. Italian scholar Angelo Catalano et al. [5] deployed fiber Bragg grating (FBG) sensors in critical areas to collect environmental acoustic waves. By analyzing the collected acoustic signals, they could distinguish between abnormal sounds caused by foreign object intrusion and background noise, thereby determining the presence of an intrusion. Sun Rui et al. [6] investigated the temperature and strain sensing characteristics of multiple reflection peaks in Bragg gratings, which were verified through experimental simulations. By fusing wavelength division multiplexing (WDM) with space division multiplexing (SDM) technologies, they enhanced the multiplexing performance and expanded the range of FBG sensors, achieving efficient and accurate distributed monitoring. D. Sinha et al. [7] installed accelerometers on railway tracks at specific locations to precisely measure track vibrations. Utilizing a novel Bayesian analysis based on the Monte Carlo method to extract obstacle drop signals, they effectively filtered out locomotive motion noise and accurately identified obstacle drop events, thereby enabling a more precise evaluation of railway track sensing technology performance. Researchers such as Liu K [8] have developed an intelligent foreign object intrusion monitoring system for railway station tracks. This system collects and transmits data via sensors and processes the information using intelligent algorithms to enhance the safety, reliability, and economy of high-speed railway operations. However, such contact-based detection technologies share a common characteristic: they rely on sensors to detect minute changes in protective fences. Once a foreign object intrudes onto the track, the sensors rapidly collect relevant signals and transmit them to the control center via a signal transmission system. After analyzing and processing the received signals, the control center initiates corresponding safety measures. Nevertheless, these track obstacle detection systems exhibit significant instability. The sensitivity of the sensors and the receivers is highly susceptible to interference from environmental factors, leading to a high false alarm rate, which poses a potential threat to railway operational safety.
In recent years, multimodal detection technologies (e.g., RGB–thermal fusion) have demonstrated outstanding generalization capabilities in complex environments for object detection, providing a new research direction for railway foreign object detection. However, these technologies face challenges such as high hardware costs and complex data fusion, which have hindered their practical engineering implementation. In contrast, machine vision technology has been widely applied in various fields such as quality control, automated production, and safety monitoring due to its advantages of high efficiency, precision, consistency, and reliability. Based on their development timeline, machine vision technologies can be categorized into traditional image processing methods and deep learning-based detection methods. Traditional image processing methods primarily rely on manually designed feature extraction algorithms to perform preprocessing, feature extraction, and classification recognition on images of foreign object intrusions. Guo Baoqing et al. [9] proposed an obstacle detection algorithm based on image processing. This algorithm utilizes one-dimensional gray projection combined with Gaussian filtering for rapid de-shaking. Furthermore, they proposed a background update algorithm based on the statistical distribution of foreground objects. By defining a target dispersion index to optimize the row and column projection order, this method solves the ghosting problem inherent in traditional background update algorithms. Zhang Lihua [10] proposed a dehazing algorithm based on dark channel prior and bottom-hat transformation to improve image quality and reduce processing time. An adaptive threshold Canny operator combined with connected component analysis was employed to detect track edges and eliminate irrelevant ones. Additionally, the ViBe and CamShift algorithms were fused to achieve moving target tracking and accurately determine track intrusion behavior. Although the aforementioned methods have achieved certain results in specific scenarios, their performance is often limited by the capability of manually designed features. Moreover, they are sensitive to noise in images and possess poor generalization ability. Furthermore, they cannot determine the type of foreign object, which limits their application in real-world foreign object detection.
With the rapid development of artificial intelligence technology, deep learning (DL) [11], as an advanced machine learning paradigm, has achieved significant progress in the field of computer vision. Unlike traditional machine vision, which relies on manually designed feature extractors, deep learning methods utilize the hierarchical structure of neural networks and training on large-scale datasets to automatically learn and extract high-level features from raw images [12], eliminating the need for manually designed feature extraction algorithms. This approach demonstrates stronger adaptability and higher detection accuracy, thereby achieving significant breakthroughs in tasks such as image classification [13], object detection [14], and image segmentation [15]. In particular, the application of object detection technology has enabled the automatic identification and classification of railway track foreign objects, greatly improving detection efficiency and accuracy.
Currently, object detection technologies in the field of deep learning are primarily categorized into two classes: two-stage algorithms and one-stage algorithms. Two-stage algorithms subdivide the object detection task into two phases: region proposal generation and region proposal classification, making them particularly suitable for scenarios requiring high detection accuracy. Typical two-stage detection algorithms include R-CNN (regions with convolutional neural networks) [16], fast R-CNN [17], faster R-CNN [18], and mask R-CNN [19]. These algorithms have been extensively researched in the field of railway foreign object detection. For instance, addressing the shortcomings of traditional railway foreign object detection algorithms—such as low recognition accuracy, ambiguous classification, and susceptibility to external environmental influences—Tao Huiqing [20] proposed replacing fully connected layers with global average pooling layers to reduce the number of parameters while increasing the number of anchors to improve the precision of target region proposals. Sun et al. [21] and He et al. [22] enhanced the speed and accuracy of railway intrusion detection by utilizing a fault detection pyramid and an improved region-based convolutional neural network (R-CNN) structure. Despite their distinct advantages in accuracy, two-stage detection algorithms suffer from slow detection speeds, making it difficult to meet the requirements of real-time detection.
In the field of object detection, one-stage algorithms simplify the detection task into a single forward propagation process, thereby avoiding the region proposal generation step. Consequently, while ensuring detection accuracy, they significantly improve detection efficiency. Representative one-stage detection algorithms such as YOLO (You Only Look Once) [23] and SSD (Single Shot MultiBox Detector) [24] have demonstrated significant advantages for foreign object detection in transportation scenarios. For instance, [25] proposed an improved SSD algorithm for multi-target detection in railway environments and developed a feature extraction convolution kernel composed of multi-scale Gabor and color Gabor to better extract image features and improve model detection accuracy. Reference [26] introduced a railway foreign object detection method based on faster R-CNN. This approach cuts down model parameters by replacing fully connected layers with global average pooling and adds more anchor points to boost the accuracy of target region proposals. Experimental findings demonstrate that this approach enables precise identification of pedestrians, vehicles, and animals. Reference [27] developed a railway foreign body intrusion detection framework using an enhanced YOLOv4-tiny model. By integrating an attention mechanism and OSA modules into the RODNet backbone, it preserves detection accuracy while decreasing the model’s parameter count and computational overhead. Reference [28] presented an optimized YOLOv5s-based detection method for railway foreign objects, which boosts the network’s capability to extract both foreign objects and environmental features via a hybrid attention mechanism and a DW decoupled detection head. Reference [29] proposed an enhanced YOLOX-nano algorithm for railway detection, which adopts the CIoU loss function and the SA-NET lightweight attention mechanism, achieving favorable detection results in embedded platform tests. Yan Ce [30] proposed an improved YOLOv7 algorithm for detecting foreign objects on railways. By introducing the CARAFE operator to reduce feature loss and increase the receptive field, adopting GhostConv convolution to reduce computational complexity and parameters, combining a global attention mechanism to enhance information interaction and expression capabilities, and using the Alpha GIoU loss function to improve small object detection capability and model convergence speed. Although the optimized YOLOv7 model strikes a balance between detection accuracy and efficiency, the YOLOv7 model itself is relatively complex, leading to increased computation time. Despite the excellent performance of object detection algorithms in simulation experiments, factors such as the complex railway operating environment and insufficient computing power may lead to problems in practical railway foreign object detection applications. Among these, issues such as missed detection of small objects, insufficient detection speed, and difficulty in deployment on embedded devices urgently need to be addressed. Therefore, this paper aims to investigate solutions to enhance the effectiveness of railway perimeter foreign object intrusion detection methods.
In the current field of research, although deep learning-driven single-stage and two-stage object detection algorithms have achieved high detection accuracy in foreign object intrusion detection, these methods still present several limitations. Specifically, these limitations include high computational complexity, large resource requirements, and substantial model size. These issues pose significant challenges for practical applications, particularly in resource-constrained environments for online railway foreign object intrusion detection. Meanwhile, the YOLO series algorithms are highly regarded in the field of object detection. In particular, YOLOv8 has gained widespread recognition due to its superior accuracy, fast inference speed, and moderate model size. Against this backdrop, this research aims to propose an improved foreign object intrusion detection method built on the YOLOv8 algorithm. The main innovations of this study are as follows:
Firstly, this study designs a spatial coordinate attention module (SCAM) that integrates coordinate attention and spatial attention mechanisms. By combining coordinate information with spatial location data, this module adaptively recalibrates feature weights to reinforce the model’s focus on target regions. This design boosts the model’s capability to capture critical features, thereby elevating detection precision and enabling accurate identification of foreign objects in railway scenarios.
Secondly, this paper introduces a feature reuse-based cascaded dilated convolution (CDConv) module, which is embedded into the backbone network to enable multi-scale feature extraction without a substantial increase in parameter count. This design effectively enhances the detection accuracy of foreign objects on railway tracks.
Thirdly, this paper develops a more lightweight decoupled detection head, dubbed X6Detect. In contrast to the decoupled head in YOLOv6, X6Detect boasts a more streamlined architecture and reduced computational overhead while preserving high detection accuracy.
Finally, This paper constructs a self-collected foreign body intrusion dataset, which contains 4000 images. After screening, deduplication, and labeling, 3000 valid images are obtained and divided into three different categories.
Subsequent experimental validation demonstrates that the optimized YOLOv8 algorithm model can further fuse multi-scale features and conduct more detailed analysis of foreign objects, thus enabling the identification of target types and locations with greater accuracy while improving the model’s robustness and generalizability. This technology provides an innovative solution for the field of foreign object intrusion detection and is expected to significantly improve the efficiency of railway safety operations.
2. Related Work
2.1. Onboard Sensing Unit
As a core module for moving trains to acquire surrounding environmental information, the onboard sensing unit functions analogously to the “visual” and “auditory” systems of a vehicle. By integrating a variety of advanced sensors and supporting processing equipment, this unit perceives critical environmental data—such as road conditions, obstacles, and traffic signals—in real time. It performs preliminary fusion and processing of the raw information and subsequently transmits the results to the decision-making layer of the autonomous driving system. This provides crucial data support for vehicle path planning, behavioral decision-making, and motion control.
The configuration of the onboard host includes six types of boards: power supply, core processing, communication, input/output (I/O), switch, and backplane. Its supporting unit integrates sensing devices such as binocular cameras, LiDAR, and millimeter-wave radar and is equipped with a computing platform capable of data acquisition, preprocessing, and sensor fusion. This ensures the continuous, stable, and accurate acquisition of environmental information under all-weather conditions, varying lighting, and complex road scenarios. The physical appearance of the onboard sensing unit is illustrated in Figure 1.
The onboard sensing unit is deployed on the console behind the windshield. On the one hand, the windshield provides basic protection for the equipment, effectively reducing contamination of the sensor lenses by external dust and rain. This ensures the clarity of the captured images and avoids increased difficulty in algorithm processing due to excessive data noise. On the other hand, the console area offers an unobstructed and wide field of view, enabling the sensors to clearly capture environmental information in critical close-range regions ahead of and to the sides of the train. This setup completely covers core detection targets such as tracks, turnouts, and nearby obstacles, preventing the loss of algorithm training data caused by perception blind spots.
To further improve data quality, the equipment is rigidly connected to the console using a dedicated anti-vibration mounting bracket. A highly elastic shock-absorbing pad is installed at the base of the bracket to effectively filter high-frequency vibrations during train operation. This prevents sensor attitude deviation caused by vibration, thereby avoiding issues such as ghosting and positional offset in the collected data. This design directly guarantees the spatiotemporal consistency of the data required by the visual detection algorithm, serving as a crucial prerequisite for multi-frame data fusion and target trajectory tracking optimization.
2.2. YOLOv8 Algorithm Model
As the latest iteration of the YOLO series, YOLOv13 integrates Transformer and CNN architectures to implement a global attention mechanism, thereby enhancing perception capabilities in complex scenarios, and introduces self-supervised learning to reduce reliance on annotated data. However, the model’s substantial scale leads to high deployment costs and hinders real-time operation on edge devices. In contrast, YOLOv8 has garnered extensive application and rigorous validation, supported by mature community resources and technical documentation, which guarantees its reliability and stability in real-world research deployments. Taking into account model innovation, practical applicability, and application maturity, this research adopts YOLOv8 as the baseline framework.
YOLOv8 offers five models of different scales. Among these variants, YOLOv8n, as the most lightweight version, features fast detection speed and low resource consumption, rendering it well-suited for edge device deployment. Therefore, this paper selects YOLOv8n as the reference model for railway track foreign object detection. Composed of four functional modules, the network includes an input end, backbone, neck, and head. Its structure is illustrated in Figure 2.
Although the YOLOv8n algorithm model exhibits significant advantages in the field of real-time detection, directly applying it to railway track foreign object detection tasks still presents challenges in recognizing foreign objects of varying scales, particularly those of smaller sizes. To effectively address these challenges, specific adjustments and optimizations to the algorithm are necessary to enhance its capability to detect objects across different scales, thereby improving the comprehensive performance of railway track foreign object detection. This process is crucial for the timely identification and handling of potential safety hazards, ensuring the safety and smoothness of railway transportation.
3. Materials and Methods
The original YOLOv8n algorithm cannot fully satisfy the requirements of tracking foreign object detection; therefore, this paper improves the YOLOv8n algorithm in the following aspects. First, by integrating CDConv into the backbone network, the number of network parameters and computational complexity are reduced, while detection speed and accuracy are simultaneously improved. Second, this paper designs a lightweight decoupled head, named X6Detect, to replace the original head, enabling effective separation of classification and regression tasks. Finally, this paper designs the SCAM module, which substantially boosts the model’s capability to focus on critical target regions. The architecture of the enhanced network is presented in Figure 3.
3.1. Spatial Coordinate Attention Mechanism (SCAM)
As deep learning technology continues to evolve, a growing body of research has underscored the critical value of integrating attention mechanisms into neural network designs. A classic example of a channel attention mechanism, the squeeze-and-excitation network (SE-Net) [31] has found widespread use across diverse convolutional neural network architectures. The computational workflow of the channel attention module is described in detail by Equation (1). Additionally, the convolutional block attention module (CBAM) [32] is widely regarded as a leading method in the field of spatial attention mechanisms. By adopting a serial structure, CBAM realizes the fusion of channel attention and spatial attention. The primary goal of the channel attention module is to adaptively recalibrate channel-wise feature responses, emphasizing the significance of information-rich channels.
In this study, the symbol σ represents the sigmoid activation function, MLP refers to the fully connected network layer, while AvgPool and MaxPool correspond to average pooling and max-pooling operations, respectively, and F denotes the input feature map [33].
The spatial attention mechanism is designed specifically to zero in on the spatial aspects of feature maps, shown in Figure 4. This approach helps the network lock onto the most influential local features by spotlighting the standout elements within each channel. As illustrated in Figure 5, this method leaves no stone unturned when assessing both feature similarities and their spatial arrangement. The nuts and bolts of this attention module are spelled out in Equation (5). Across Figure 4, Figure 5 and Figure 6, the symbols C, W, and H represent the channel count, width, and height of the feature map, respectively.
The spatial attention module’s computational approach is neatly captured by Equation (2), which employs a 7 × 7 convolution kernel as its building block. This equation brings together channel concatenation—denoted by Concat—while leveraging both max-pooling and average pooling operations across the channel dimension (Channel_MaxPool and Channel_AvgPool, respectively). The sigmoid activation function, represented by σ, helps fine-tune the output. Throughout this process, F stands as the input feature map, with Mc(F) ultimately producing a feature map that assigns importance weights to every spatial position, allowing the model to zero in on the most critical areas.
Coordinate attention (CA) [34], depicted in Figure 4, represents a specialized channel attention mechanism tailored for spatial dimensions. This innovative methodology refines two-dimensional feature maps and concurrently detects long-range dependencies across each spatial axis through the encoding of features along both the X (width) and Y (height) planes. The technique allows for the preservation of spatial positional relationships that often get lost in conventional pooling-based aggregation methods. By incorporating CA, the model gains the ability to effectively grasp spatial details, ultimately boosting its comprehension of the broader contextual landscape.
The core mission of the channel attention mechanism revolves around crafting an attention vector tailored to each (X,Y) coordinate within the input feature map, which we will call X sporting dimensions of H, W, and C (representing height, width, and channel count, respectively). This specialized attention vector, Attention(X,Y), comes to life by taking into account the spatial location of the (X,Y) position, and its computation can be mathematically expressed in the following manner:
To extract the feature vector f(X,Y), we first subject the position (X,Y) within the feature map to a double application of convolutional layers. Rather than using global average pooling across the entire feature map, we adopt a more refined strategy by applying 1D average pooling separately along the X and Y axes.
Subsequently, the one-dimensional vectors obtained from both directions are convolved to integrate their information. In the final stage, the merged vector is partitioned into two coordinate sets that correspond to the height (H) and width (W) dimensions. This specialized module improves the model’s ability to perceive input features from multiple angles, allowing it to better capture directional characteristics and thus substantially boosting overall performance. The corresponding mathematical formulation is given below:
In the aforementioned formulas, W represents the number of horizontal pixels, and H denotes the number of vertical pixels. In Formula (4), indicates the pixel value of the c-th channel at coordinates in the input feature map. Its function is to accurately capture spatial dependencies in the vertical direction of the feature map, such as vertical positional characteristics of targets in railway foreign object detection. Formula (5) is similar to Formula (4) but differs in the direction of operation. In Formula (6), the and obtained from Formulas (4) and (5) are first spliced to generate a feature integrating bidirectional coordinate information, followed by transformation to enhance feature representation and prepare for subsequent attention weight calculations. In Formulas (7) and (8), the feature space is first mapped before normalization. By assigning adaptive weights to each height or width position in the feature map, the model focuses on key horizontal regions. These five formulas constitute the core computational chain of the coordinate attention module. Through the “decoupling–fusion–mapping–normalization” steps, it addresses the issues of traditional spatial attention modules that model two-dimensional spatial information coarsely and neglect directional features, enabling the model to more accurately capture spatial positional characteristics of targets like railway foreign objects.
The SCAM module brings together the spatial-channel attention mechanism from CBAM while maintaining CA’s knack for keeping track of position. As you can see in Figure 5, CA forges connections across distant areas along both the X and Y axes, which sets the stage for the receptive field’s basic coverage. Following this, the spatial attention component zeroes in on locally useful information built upon that groundwork. By merging spatial and channel filtering techniques, SCAM manages to grasp both the bigger picture through long-range connections and the finer details locally, giving the model a leg up in perception and beefing up its feature representation game.
3.2. CDConv Module
Introduced by a Google research team in 2015, dilated convolution is designed to enlarge the receptive field of a neural network without increasing its parameter count. This technique was first integrated into the DeepLab model [35]. Through dilated convolution, the receptive field can be significantly enlarged even while keeping the kernel size constant. As illustrated in Figure 7, a standard 3 × 3 convolution kernel can simulate the effects of 5 × 5 and 7 × 7 kernels. Specifically, Figure 7a demonstrates the standard convolution process with a dilation rate of 1, covering an actual input area of 3 × 3. Figure 7b illustrates a case where the convolution kernel weights are spaced 1 pixel apart (pixel spacing 2), and the dilation rate is 2. With the same number of parameters, the actual input area expands to 5 × 5. Figure 7c shows the convolution kernel weights spaced 2 pixels apart (pixel spacing 3) and a dilation rate of 3. Under the same parameter constraints, the input area further expands to 7 × 7.
In the process of detecting railway track foreign objects, directly employing multiple dilated convolution layers can expand the receptive field, but this method usually leads to a substantial increase in parameter count. To address this limitation, the study proposes a module named CDConv. The module adopts a feature reuse mechanism and realizes parameter sharing by serializing multiple expansion convolution layers, which effectively reduces the consumption of computing resources. Furthermore, the CDConv module achieves multi-scale feature extraction for input foreign object images by constructing multiple parallel branches of receptive fields. The architecture of the CDConv module is depicted in Figure 8.
Let the input feature map be denoted as x. Through three consecutive dilated convolution operations, we can obtain:
where D_i_(·) denotes the i-th expansion convolution operator, used for feature extraction and receptive field expansion.
Subsequently, batch normalization (BN) and average pooling (AvgPool) layers are applied to the output of each dilated convolution. These operations are designed to optimize the quality and efficiency of feature extraction. Their mathematical expressions can be formulated as
where BN(·) and Avg(·) represent the batch normalization and average pooling operations. By concatenating , , , a multi-scale feature fusion layer can be constructed. This layer effectively integrates feature information extracted from different dilation rate convolution branches, preserving both detailed texture features and enhancing global contextual information. The mathematical expressions can be formulated as
The CDConv module is integrated into the YOLOv8n model. By configuring different dilation rates, it effectively expands the receptive field without significantly increasing parameter count, enabling the model to capture contextual information across broader regions. Meanwhile, the multi-scale parallel branches extract feature details of foreign objects in varying sizes. Additionally, the feature reuse mechanism in CDConv effectively suppresses parameter growth, ensuring performance improvements without substantially increasing computational load. This meets the real-time requirements of railway foreign object detection systems.
In subsequent experiments, this study replaced the first convolutional layer of the YOLOv8n backbone network with a CDConv module. The module achieves the reduction in parameters by feature reuse and serial hollow convolution parameter sharing. The subsequent increase in overall model parameters resulted from the integration of the SCAM module and X6Detect detection head, which is unrelated to the parameter optimization design of the CDConv module itself.
3.3. X6 Detection Decoupled Head
In the field of object detection, network architectures primarily employ two strategies: coupled heads and decoupled heads. The coupled head shares parameters between classification and regression tasks within a single convolutional kernel, optimizing both classification loss and localization loss during training. While this design offers structural simplicity and fewer parameters, it has inherent limitations: classification focuses on semantic features (e.g., object category attributes), whereas regression relies more on geometric features (e.g., object bounding box coordinates). These two tasks fundamentally differ in their feature requirements. When the coupled head processes both tasks simultaneously, gradients from different tasks interfere with each other, making it difficult to optimize both loss functions effectively. This issue becomes particularly critical in railway object detection scenarios requiring high localization accuracy, where problems like bounding box prediction deviations or mismatches between classification confidence and localization precision may occur.
By contrast, decoupled detection heads separately handle classification and regression tasks through independent branches, with each task equipped with an independent feature extraction module. While this design confers superior flexibility and modularity to the model, it suffers from the drawback of demanding greater computational overhead and leading to a larger network scale, owing to the separate training processes required for the detection and classification branches. It is thus particularly imperative to design a lightweight, decoupled network architecture with effective information fusion capabilities.
Building upon the decoupling strategy of YOLOv6, this study proposes and constructs the X6Detect decoupling method, which further reduces the number of computational parameters. Figure 9a,b illustrate the V6 detection head and the X6 detection head, respectively.
The core differences between X6Detect and the YOLOv6 detection head are reflected in channel count and layer structure. Firstly, X6Detect reduces the channel count from 256 to 128 through 1 × 1 convolution during the basic feature extraction stage, while the YOLOv6 detection head maintains 256 channels. Secondly, X6Detect eliminates two redundant 3 × 3 convolution layers in the YOLOv6 detection head and employs gradient fusion technology to enable parameter sharing at the lower feature level, unlike the fully decoupled architecture of YOLOv6. This reduces the parameter size from 12.89 M to 8.30 M, a 35.7% decrease. The shared feature representation mechanism in X6Detect promotes interaction between different tasks and facilitates the learning of shared features. Ultimately, this method achieves a dynamic balance between detection accuracy and detection efficiency.
4. Experiments and Analysis
Building upon the aforementioned research, this section validates the performance of ACX-YOLOv8. In this part of the study, the experimental configurations are first outlined. Subsequently, the dataset and the evaluation metrics for assessing model performance are described in detail. Finally, the experimental results are analyzed in depth.
4.1. Experimental Environment
The details of the experimental setup for this study are presented in Table 1.
4.2. Experimental Evaluation Metrics
To accurately evaluate model performance, this study employed several common evaluation metrics for object detection models, including precision, recall, average precision (AP), mean average precision (mAP), frames per second (FPS), and parameters. The specific calculation formulas are as follows:
The calculation of the aforementioned metrics is based on the confusion matrix, with the specific classification confusion matrix detailed in Table 2.
In the field of object detection, there may be various deviations between predicted values and ground truth values. Based on these deviations, detection results can be subdivided into four categories. Among these, true positive (TP) represents the number of samples that the model accurately predicts as positive; false positive (FP) represents the number of negative samples incorrectly predicted as positive; true negative (TN) represents the number of samples that the model accurately predicts as negative; and false negative (FN) represents the number of positive samples incorrectly predicted as negative.
4.3. Experimental Dataset
Given the scarcity of foreign object intrusion detection samples in the field of rail transportation, this study constructed a dataset named the Orbital Foreign Body Dataset (OFBD). The obstacle of the railway track is defined as all the foreign objects invading the railway building boundary. This dataset is derived from video data captured during actual train operations, comprising 4000 frames extracted from railway surveillance footage. After screening, deduplication, and annotation, 3000 valid images were obtained. The images were divided into a training set, validation set, and test set in an 8:1:1 ratio—a standard approach for small-sample computer vision tasks [36]. This proportion ensures the training set has sufficient samples for effective model fitting while maintaining adequate sample sizes in the validation and test sets, thereby guaranteeing the objectivity and accuracy of model performance evaluation. The OFBD includes both daytime and nighttime scenarios and primarily covers three categories: boxes, humans, and billboards. The partial OFBD is shown in Figure 10. In this study, billboards are classified as foreign objects due to their tendency to intrude into the railway building clearance when overturned, similar to pedestrian intrusion, both of which may potentially cause train collision accidents. The classification criteria are referenced from the “Railway Line Design Code” (TB 10098-2017) and relevant research on railway foreign object detection [37].
The OFBD training set exhibits the following qualitative and quantitative characteristics. Scene diversity: It covers complex environments, including daytime, nighttime, and overcast conditions. Scale representation: It includes small, medium, and large targets. Annotation quality: The dataset is labeled in VOC format with pixel-level precision, and after review by three professionals, the annotation accuracy rate reaches 99.5%. These features demonstrate the dataset’s excellent diversity in terms of scene types, scale variations, and target categories, effectively supporting model training and ensuring practical application value. Furthermore, all data in this dataset are sourced from real-world scenarios, making it highly representative.
Furthermore, this study utilized PASCAL VOC 2012 and VOC 2007 [38] to create a new dataset for conducting supplementary experiments, thereby evaluating the effectiveness of the ACX-YOLOv8 model. Specifically, the validation set was derived from the VOC 2012 validation set, containing 5800 images; the test set was sourced from the VOC 2007 validation set, containing 2510 images. Through this dataset configuration, the model’s generalization capability is further examined to ensure its effectiveness across various scenarios.
4.4. Results and Analysis
4.4.1. Module Ablation Experiments
To validate the practical efficacy of algorithm optimization at each stage, this study designed an ablation experiment. The experiment aims to systematically assess the proposed improvement strategies’ impact on model performance through controlled variable methodology. The evaluation metrics employed in this study comprise mean accuracy (mAP), model parameter count, and frame per second rate (FPS). mAP serves as a key metric for assessing algorithm accuracy. The parameter count provides critical insights into model scale, which helps better reflect edge computing costs. FPS validates the model’s real-time inference speed, thereby evaluating its real-time performance.
The outcomes of the ablation experiments are comprehensively presented in Table 3, which outlines various model variants, including A-YOLOv8, AC-YOLOv8, and ACX-YOLOv8.
In the A-YOLOv8 configuration, the coordinate spatial attention mechanism architecture is integrated, whereas the AC-YOLOv8 configuration fuses the coordinate spatial attention mechanism architecture with CDConv. ACX-YOLOv8 is further equipped with the comprehensive application of the coordinate spatial attention mechanism architecture, CDConv, and the X6 detection head.
As shown in Table 3, by first introducing the coordinate spatial attention module, the model was able to strengthen its capability to learn inter-channel relationships, enabling the network to automatically identify and learn the correlation and importance between different channels, thereby extracting more discriminative feature representations. Experimental findings demonstrate that the module’s mean mAP increased by 0.4 percentage points, while the total number of parameters decreased by 0.05M. This proves that our proposed module can effectively enhance the feature detection capability while reducing model complexity.
Secondly, the introduction of the CDConv module enables the model to achieve a broader receptive field, facilitating multi-scale feature extraction from images. Experimental data shows that the mean absolute performance (mAP) improved by 0.7 percentage points while reducing model parameters by 0.06 million. However, the parameter differences between models stem from modular integration. When deployed independently, the CDConv module reduces total model parameters by 0.05M compared to YOLOv8n. Detailed analysis will be presented in the following sections.
Finally, in Experiment 4, after introducing the three modules mentioned above, the detection accuracy increased by 2.7% compared to Experiment 1, and the number of parameters increased by approximately 61 percentage points. The experimental results demonstrate that the improved YOLOv8n algorithm significantly improved detection speed and recognition rate, truly achieving high efficiency in tracking foreign object detection.
Figure 11 provides a comparative visualization of the ablation experiments on the OFBD. Corresponding to the various module improvement strategies listed in Table 3, it more accurately demonstrates the effectiveness of each improved model on the railway track foreign object dataset.
In Figure 11, (a) shows the raw input image, (b) displays the detection performance of YOLOv8n, (c) outlines the results generated by A-YOLOv8, (d) details the outputs from AC-YOLOv8, and (e) illustrates the final detection results of ACX-YOLOv8.
Figure 12 depicts the precision/loss curves of each ablation model on the OFBD. Among them, (a) shows the precision curve, showing how mAP50 evolves throughout training, and (b) shows the loss curve, illustrating the changes in validation loss over iterations. The results of this study demonstrate that, compared with the YOLOv8n model, the model proposed in this study demonstrates superiority in both precision and convergence rate. This method exhibits high accuracy and fast convergence efficiency in object detection tasks, thereby verifying its effectiveness.
4.4.2. Module Comparison Experiments
Integration of the X6 detection decoupled head further enhanced the model’s detection performance, with the mAP improved by 2.7%. Table 4 provides the comparison results of three different probes. Specifically, the “V6 detection head” denotes the decoupled head employed in YOLOv6, whereas the “decoupled detection head” corresponds to that utilized in YOLOX [39]. According to the comparative results presented in Table 4, the X6 detection decoupled head achieves a significant reduction in both the number of parameters and computational complexity, while retaining comparable accuracy, relative to the V6 detection head and the decoupled detection head.
To investigate the impact of differences in the placement and quantity of CDConv on the model’s detection performance, this study carried out a set of comparative experiments. In the experiment, different numbers of CDConv modules were integrated into various layers of the YOLOv8n model. During the experimental process, all other parameters were kept consistent except for the application position and quantity of CDConv. A detailed breakdown of the experimental results is provided in Table 5.
According to Table 5, embedding one CDConv module in the backbone structure resulted in superior model performance in terms of P, R, and mAP50. However, as the number of CDConv modules in the backbone increased to three, both precision and recall experienced a slight decline, and the average precision also diminished to a certain extent. When the number of CDConv modules was further increased to five, the average precision exhibited a further downward trend. In contrast, adding CDConv modules to the head structure—increasing from one to two—yielded negligible changes in P and R, and the improvement in mAP was not significant. Furthermore, this performance was inferior to that achieved by embedding a single CDConv module in the backbone structure. Comprehensive analysis indicates that embedding one CDConv module in the backbone structure has a notably positive effect on enhancing model detection performance. It maintains a high average precision while ensuring a certain level of precision and recall, thereby providing an effective strategy for model optimization. Based on these findings, this study proposes replacing the first convolutional layer of the backbone network with a CDConv module.
4.4.3. Model Comparison Experiments
To assess the detection capability of the enhanced YOLOv8n model, the study performed a comparative evaluation against a suite of state-of-the-art models, including YOLOv5, YOLOX, YOLOv7, YOLOv8, YOLOv9, YOLOv10 and YOLOv13. The comparative data on algorithm performance are detailed in Table 6.
In the experiment, all contrast models adopted a unified training strategy, including input image size, training rounds, and batch size, to ensure the fairness and validity of the contrast experiment.
With identical parameter settings, the detection performance comparison across various models is presented in Table 6. As shown in Table 6, the enhanced YOLOv8n model exhibits superior performance relative to the other models listed in Table 6. Specifically, the optimized model achieves a significant boost in accuracy compared to the baseline, while the increase in computational cost is relatively small. Furthermore, when compared to other models, the ACX-YOLOv8 model maintains a low number of parameters while also demonstrating a strong advantage in accuracy.
The visualization of detection results is shown in Figure 13, which intuitively presents the detection effects of the improved model and the original model. It can be clearly seen that the improved YOLOv8n model exhibits the best detection performance and highest accuracy, which further verifies the effectiveness of the improved algorithm.
4.4.4. Model Generalization Experiments
To assess the generalization capability of the railway track foreign object detection model, the paper conducted experimental validation using the VOC dataset. The hyperparameter tuning and training protocols were identical to those used for the OFBD. A detailed breakdown of the experimental results is provided in Table 7. Our findings show that with an IoU threshold of 0.5, the ACX-YOLOv8 model achieves superior performance compared to other baseline models.
Through analysis of the experimental results, we can conclude and verify that our enhanced model exhibits robust generalization ability across diverse tasks and datasets. These findings demonstrate that our model not only delivers outstanding performance in specific domains and datasets but also exhibits strong adaptability in novel environments, thus underscoring the broad applicability of the enhanced model.
4.5. Discussion
4.5.1. Model Performance Analysis
On the OFBD, ACX-YOLOv8 achieves 87.1% mAP50 and 68.2 FPS, outperforming YOLOv8n in both precision and real-time performance, demonstrating the effectiveness of our improvement strategy. Compared to advanced models like YOLOv9-tiny and YOLOv10n, ACX-YOLOv8 achieves higher detection accuracy with only 4.85 million parameters, highlighting the advantages of a lightweight design and making it more suitable for deployment in edge devices along railway tracks.
4.5.2. Module Effectiveness Analysis
Experimental results demonstrate that the integration of the SCAM module, CDConv module, and X6Detect detection head achieves a synergistic effect where 1 + 1 + 1 > 3. The SCAM module enhances target feature saliency, and the CDConv module expands the receptive field for multi-scale feature extraction, while the X6Detect detection head decouples classification and regression tasks. This tripartite collaboration effectively addresses challenges in railway track foreign object detection, including small target detection gaps, insufficient feature extraction, and excessive computational load.
4.5.3. Engineering Application Value
The engineering application value of the ACX-YOLOv8 model is in its practical significance for railway track foreign object detection. First, its lightweight design (4.85 MB parameters) makes it suitable for edge computing devices like railway-line embedded systems. It allows on-site real-time detection, cuts latency, and ensures timely response to hazards. Second, its high mAP50 values and satisfactory FPS on OFBD and VOC datasets show good detection performance. It can reliably identify various foreign objects, which is vital for preventing railway accidents. Also, the accuracy–efficiency balance enables the model to work continuously and stably in complex field conditions. This improves railway safety, cuts maintenance costs through proactive inspections, and boosts transportation efficiency. The model’s generalization ability extends its application to other industrial scenarios needing real-time, accurate, and efficient object detection with limited resources.
4.5.4. Research Limitations
The limitations of this study are as follows: ① The OFBD only includes three categories of foreign objects, excluding other common railway hazards such as animals and falling rocks; ② The model is based solely on visual single-modal detection, and its performance under extreme weather conditions (e.g., heavy fog or blizzard) requires further improvement; ③ No long-term deployment tests were conducted in actual railway environments, and the model’s engineering stability needs further validation.
4.5.5. Future Research Directions
To address these limitations, future research directions include: (1) expanding the OFBD by adding more foreign object categories and sample sizes, such as animals and falling rocks; (2) integrating multimodal detection technologies (e.g., RGB-LIDAR and RGB–thermal fusion) to enhance the model’s performance in extreme environments; (3) conducting field deployment tests on actual railway sites to optimize the model’s engineering stability and robustness.
5. Conclusions
To address practical needs in railway track foreign object detection, this study proposes an improved algorithm, ACX-YOLOv8, based on YOLOv8. Through a series of experiments, the effectiveness of the model is validated, with the following key conclusions:
- The designed SCAM spatial coordinate attention module effectively enhances the model’s feature extraction capability for target regions, improving the model’s mAP50 by 0.4% while reducing parameters by 0.05 million.
- Building upon the foundation of SCAM, the proposed CDConv cascaded hollow convolution module can replace the first convolutional layer of the YOLOv8n backbone network, enabling multi-scale feature extraction. This method improves the model’s mAP50 by 0.7% while reducing the parameter count by 0.06 million.
- Building upon the foundation of SCAM + CDConv, the lightweight decoupled detection head developed in this paper achieves a 35.7% reduction in parameters compared to the YOLOv6 detection head, boosting the overall model’s mAP50 to 87.1% and surpassing YOLOv8 by 2.7%.
- The ACX-YOLOv8 model, with 4.85 million parameters, achieves an mAP50 of 87.1% and an FPS of 68.2 on the OFBD and an mAP50 of 92.2% on the VOC dataset, representing a 1.8% improvement over YOLOv8n. By balancing detection accuracy, real-time performance, and generalization capability, it provides an efficient and reliable technical solution for real-time detection of railway track foreign objects.
Notably, the improved model successfully identified railway track foreign objects at long distances. This advancement proves its exceptional ability to handle complex and difficult environmental conditions. This capability to recognize long-distance railway track foreign objects is of vital significance for ensuring railway operational safety. In actual railway operation scenarios, foreign objects on long-distance tracks are often difficult to detect in a timely manner, and the consequences would be unimaginable if these objects were to affect train operations. The improved model can accurately identify such foreign objects, greatly enhancing the reliability of railway safety monitoring. In future research, we will continue to refine this model and integrate multimodal technologies to provide more comprehensive and efficient safeguards for railway safety.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1“14th Five-Year Plan” Railway Standardization Development Plan Railw. Technol. Superv.2022504204–7+20
- 2Liu L. Gou J. Research on Railway Obstacle Detection Based on YOLO v 4J. Railw. Sci. Eng.20221952853610.19713/j.cnki.43-1423/u.t 20210113 · doi ↗
- 3Shi H. Chai H. Wang Y. Yu Z. Research on Embedded Railway Foreign Body Intrusion Detection Algorithm Based on Target Recognition and Tracking J. Railw. Sci.2015375865
- 4Nakagawa D. Hatoko M. Reevaluation of Japanese high-speed rail construction: Recent situation of the north corridor Shinkansen and its way to completion Transp. Policy 20071415016410.1016/j.tranpol.2006.11.004 · doi ↗
- 5Catalano A. Bruno F.A. Pisco M. Cutolo A. Cusano A. An intrusion detection system for the protection of railway assets using fiber Bragg grating sensors Sensors 201414182681828510.3390/s 14101826825268920 PMC 4239953 · doi ↗ · pubmed ↗
- 6Sun R. Application of fiber Bragg grating in the monitoring system for foreign object intrusion in high-speed railways Inf. Commun.201262425
- 7Sinha D. Feroz F. Obstacle detection on railway tracks using vibration sensors and signal filtering using Bayesian analysis IEEE Sens. J.20151664264910.1109/JSEN.2015.2490247 · doi ↗
- 8Liu K. Li L. Tan F. A Design of Intelligent Foreign Object Intrusion Detection System in Subway Station Track Area Proceedings of the Sixth International Conference on Transportation Engineering American Society of Civil Engineers Reston, VA, USA 201910921097
