Enhanced intelligent train operation algorithms for metro train based on expert system and deep reinforcement learning
Yunhu Huang, Wenzhu Lai, Dewang Chen, Geng Lin, Jiateng Yin, Qing-Chang Lu, Qing-Chang Lu, Qing-Chang Lu, Qing-Chang Lu

TL;DR
This paper introduces enhanced intelligent train operation algorithms that combine expert systems and deep reinforcement learning to improve energy efficiency and passenger comfort in metro trains.
Contribution
The novel contribution is the integration of expert systems with PPO-based reinforcement learning to optimize train operation without relying on offline speed profiles.
Findings
EITO algorithms outperform existing intelligent and manual driving methods in energy consumption and passenger comfort.
The EITOP algorithm shows the best performance in tests involving complex track conditions.
The proposed DMTD method increases coasting distances and reduces energy use.
Abstract
In recent decades, automatic train operation (ATO) systems have been gradually adopted by many metro systems, primarily due to their cost-effectiveness and practicality. However, a critical examination reveals computational constraints, adaptability to unforeseen conditions and multi-objective balancing that our research aims to address. In this paper, expert knowledge is combined with deep reinforcement learning algorithm (Proximal Policy Optimization, PPO) and two enhanced intelligent train operation algorithms (EITO) are proposed. The first algorithm, EITOE, is based on an expert system containing expert rules and a heuristic expert inference method. On the basis of EITOE, we propose EITOP algorithm using the PPO algorithm to optimize multiple objectives by designing reinforcement learning strategies, rewards, and value functions. We also develop the double minimal-time distribution…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5
Fig 6
Fig 7
Fig 8
Fig 9
Fig 10
Fig 11
Fig 12
Fig 13- —Fujian Provinces Education Research Project for Young and Middle-aged Teachers
- —Minjiang University Talent Introduction Research Project
- —Innovation Star Talent Program of the Third Batch in Fujian Province
- —Special Fund for Education and Scientific Research of Fujian Provincial Department of Finance
- —Scientific Research Foundation of Fujian University of Technology
- —The Natural Science Foundation of Fujian Province
- —Science and Education Joint Special Project of Minjiang University (Science and Engineering Category)
- —2023 National College Students' Innovation and Entrepreneurship Training Program
- —The Fujian Provincial Social Science Foundation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRailway Systems and Energy Efficiency · Transportation Planning and Optimization · Railway Engineering and Dynamics
1 Introduction
In recent years, with the acceleration of urbanization, urban road traffic resources can not meet the growing traffic demand, therefore, Intelligent Transportation System (ITS) [1] came into being. The automatic train operation (ATO) system [2], which replaces manual driving in many places with low cost and automation, has become an important part of ITS. While ATO systems have been increasingly embraced by many metro systems over the past decades due to their low cost and practicality, it is evident that they fall short in several critical areas. Firstly, the intelligence of these systems is limited; they often rely on predefined operational strategies and lack the dynamic adaptability to respond effectively to complex and unforeseen circumstances. Secondly, the absence of self-learning capabilities restricts their potential to improve efficiency and safety over time through the accumulation and analysis of operational data. Lastly, the generalization of these systems is constrained; they are typically tailored to specific lines and struggle to adapt to diverse line conditions, such as varying speed limits, gradients, and traffic flows, which limits their broader application. These limitations underscore the need for more advanced, intelligent, and adaptable train operation systems that can enhance operational efficiency, safety, and passenger comfort.
The speed control of metro train operation can be represented as a multi-objective optimization problem with constraints. In order to satisfy these constraints and optimize the objectives, the train must make driving decisions based on real-time information. Under normal conditions, the ATO is responsible for all train traction and braking control commands to make the train run on time, regulate its speed and stop exactly at its destination [3]. ATO is traditionally divided into two sub-modules. The first one is dedicated to the calculation of the speed profile of the future train operation. Under this module, offline optimization algorithms are used to calculate the optimal speed profile in terms of performance and energy consumption. The second sub-module works mainly to ensure that the train accurately tracks the given speed profile.
Recently, many studies have been devoted to designing an offline optimized train trajectory to improve energy efficiency. For example, Khmelnitsky [4] devised a numerical algorithm to get the best velocity profile, taking into account changeable gradients and arbitrary speed limits. Furthermore, train operation issues encompass a variety of additional factors, such as trip comfort and punctuality. Yang et al. [5] created a genetic algorithm based on binary coding, developing a two-target integer programming model with headway time control and dwell time management to find the optimal solution in terms of energy savings and service quality. Wang et al. [6] introduced a new iterative convex planning (ICP) technique to solve the train scheduling problem to achieve the ideal departure time, running time, and dwell time in order to minimize travel time and energy consumption. Using optimal speed trajectory searching methodologies under diverse track parameters, Guan et al. [7] created a multi-objective optimization model for the speed trajectory, with energy consumption and travel time as the key optimization objectives. With the development of artificial intelligence, many intelligent algorithms have been applied to train operation. Akba et al. [8] employ an artificial neural network with the genetic algorithm to optimize the coasting points of the velocity-distance trajectory to obtain minimum energy expenditure for a given travel time. Yang et al. [9] combined a simulation-based approach and a genetic algorithm to find an approximate optimal coasting control strategy. Yin et al. [10] developed ITOR algorithm for intelligent train operating capable of satisfying multiple objectives by using expert experience and Q-Learning algorithm. Zhang et al. [11] used manual driving data to train (K-NN, Bagging CART, and Adaboost CART) three well-known algorithms to predict the driver’s output control. Recently, Zhou et al. [12] proposed STO algorithm by using deep deterministic policy gradient (DDPG) and normalized dominance function (NAF) algorithms to further optimize the energy consumption, comfort during train operation metrics.
After generating the optimal recommended speed profile, the ATO’s task is to develop an efficient method to control the train relatively to different train models and operating conditions (e.g., tunnels, curves, steep gradients) so that the train can accurately track the speed profile and operate safely and smoothly. Ke et al. [13] proposed a fuzzy PID gain method to track the recommended speed profile, which was optimally generated by the MAX-MIN ant system. Song et al. [14] investigated at the consequences of time-varying failures in both the traction and braking phases of the train, and suggested an adaptive backstepping control system that was completely parameter-dependent and successful in achieving good speed tracking performance. Liu et al. [15] proposed a high-speed railway control system based on fuzzy control method and designed the control system in MATLAB. Gu et al. [16] have proposed a new energy-efficient train operation model based on real-time traffic information from a geometric and topographic perspective. Two robust adaptive control approaches considering actuator saturation and unknown system parameters were proposed by Gao et al. [17]. Recently, Pu et al. [18] proposed a model-free adaptive speed controller based on neural network (NN) and PID algorithms,and the effectiveness of the proposed algorithms to track the SD trajectory precisely is proved by numerical experiments and real-line applications.
Actually, the previous research and application has greatly improved the operational performance of metro train operation. However, there are still some basic problems that have not been solved, which hinder the development of ATO systems. Firstly, most existing ATO systems achieve their train operation goals by focusing on energy-efficient trajectory calculation, real-time tracking methods, and station parking algorithms, respectively. Especially, ATO algorithms are designed to track offline optimized speed profiles, lacking intelligence, flexibility and robustness. Few studies have comprehensively considered multiple objectives such as driving comfort, punctuality, parking accuracy, and energy consumption. Meanwhile, complex control methods are difficult to implement in real operation when faced with system non-linearity, unknown resistance and variable in-train forces. Secondly, modern metro trains are capable of outputting continuous traction and braking forces, but few studies have been conducted to design continuous control models considering complex line conditions, for example, the intelligent train operation algorithms based on reinforcement learning (ITOR) are proposed in [10], which can only achieve discrete control of train with simple line condition. Finally, there are some metro sections with more complex speed limits and gradients change metro, and most of the train models proposed only consider the operation in the intervals with simple speed limits and gradients conditions, such as the smart train operation (STO) algorithms based on normalized advantage function (STON) proposed in [12] is difficult to be applied to the case of long distances between two consecutive stations and lines with complex speed limits.
Facing these problems, new intelligent driving algorithms with a higher level of intelligence need to be investigated, which are called enhanced intelligent operation algorithms (EITO_E_ and EITO_P_) in this paper. On the one hand, experienced drivers combined with their long-term accumulated maneuvering experience can implement effective control of the train in real-time so that the train operation meets the requirements of several control objectives. Besides, they can be well adapted to different conditions of railroad lines. On the other hand, reinforcement learning(RL) has been used as a powerful decision tool [19] to tackle optimal control problems in many domains, such as micro-drone control [20], robot control [21], and with good results in the field of intelligent driving of trains [10,12]. Meanwhile, deep reinforcement learning [22] is considered to be useful for the control of continuous movements [23], the detailed demonstration is analyzed in Sect 3.2.
Therefore, we consider combining expert (experienced drivers) experience with deep reinforcement learning algorithms to achieve better and intelligent operations. The necessity of proposing both (EITO_E_ and EITO_P_) lies in their complementary strengths. (EITO_E_ leverages expert knowledge and heuristic rules to provide a robust baseline for intelligent train operation, while EITO_P_) uses deep reinforcement learning (PPO) to optimize multiple objectives dynamically. By presenting both algorithms, we aim to demonstrate how expert knowledge can be effectively integrated with advanced machine learning techniques to enhance train operation performance. This dual approach allows for a comprehensive evaluation of their effectiveness under varying conditions, showcasing the versatility and adaptability of our proposed solutions. As can be seen from the above analysis, the contributions of this paper are as follows:
- Integration of Expert Knowledge with Deep Reinforcement Learning: We introduce a novel approach that integrates expert system-based rules, distilled from experienced drivers, with the Proximal Policy Optimization (PPO) algorithm. This integration results in the development of EITO_E_ and EITO_P_ algorithms, which not only provide a robust operational baseline for intelligent train operation but also dynamically optimize multiple objectives, enhancing the adaptability and efficiency of train control systems.2) Development of EITO_E_ Algorithm: The EITO_E_ algorithm is developed by encapsulating heuristic rules and inference methods from expert drivers within an expert system framework. This innovation allows for the generation of control strategies independent of offline speed profiles, thereby offering a flexible and adaptive operational approach that is responsive to real-time train operation requirements.3) EITO_P_ Algorithm for Multi-Objective Optimization: Extending the capabilities of EITO_E_, the EITO_P_ algorithm utilizes PPO to optimize key operational objectives including safety, punctuality, energy efficiency, and passenger comfort. A significant contribution of EITO_P_ is its real-time adjustment of acceleration and braking strategies based on current train conditions and speed limits, which is crucial for maintaining energy efficiency and punctuality in metro train operations.
The rest of the paper is organized as follows. In Sect 2, we define the necessary mathematical notation and performance indicators for metro train operation, and then, we describe the problem of metro train operation. Sect 3 presents the design of the EITO algorithm based on the expert system and PPO. In Sect 4, we construct an EITO simulation platform and give three numerical examples of real data from YLBS. We conclude the paper in Sect 5.
2 Problem formulation and objectives
2.1 Problem statement
This section first formulates the train operation problem and then clearly states the objectives that the proposed algorithms aim to achieve.
The train control problem is formulated as an optimal control problem, focusing on finding an optimal control strategy for the traction and braking force during the travel time. First, the minimum time interval and the travel time of trains are defined as Eqs (1) and (2) respectively:
For , total travel time T is defined as:
where the initial run time is t0 = 0(s), and the minimum time interval is .
The train motion model, which incorporates a multi - point mass and signal coordinate model, is used to simulate the electric multiple unit (EMU) of the train. This model takes into account the interaction effects between vehicles, offering an advantage over the traditional single - point train model. The model is expressed as:
where denotes the weight of the EMU and mi denotes the weight of the i-th vehicle. denotes the interaction between vehicle [24]. denotes the force of each vehicle in the moving train, and is the distribution constant that determines the acceleration/braking force of the i-th vehicle. denotes the variation of the spring deformation of the coupler. denotes the drag force. In addition, describes the drag force caused by friction, The are the vehicle-specific factor. fc is the curve drag force defined as fc = 6.3M/r(s)−55, and r(s) is the radius of the curve [11]. is the drag force caused by the gradient, and is the gradients angle.
The multi-unit model in Eq (3) captures inter-vehicle dynamics (e.g., coupler forces, mass distribution) to better simulate real-world EMUs. While the control force u is centralized, its distribution across vehicles is governed by the force allocation constants (Sect 2.1). For simplicity, we assumed uniform distribution in simulations, as fine-grained force allocation is hardware-dependent and beyond this paper’s scope.
Furthermore, the train acceleration (or braking) system in the study has nonlinear and time delays. Transfer function of the simulated brake acceleration system:
where G(s) means the actual output, is the system performance gain, and means the delay and time constant of the train acceleration/braking model, respectively.
Metro train operation control models are generally evaluated in terms of five aspects: safety, punctuality, energy consumption, passenger comfort, and parking accuracy.
Safety: There may be multiple speed limit points between two consecutive metro stations, as shown in Fig 1, where , and are the speed limits for different sections between the two stations. That means during the travel period, the speed of the train must be lower than the current speed limit of the railroad section to ensure safety. The safety evaluation index Is is defined as:
It is note that the intention of Eq (5) is to ensure that the train’s speed remains within the designated limits. To explicitly state that the evaluation index Is is designed to enforce speed limits to prevent any misinterpretation. The condition should ensure that if the speed exceeds the limit, the index will reflect a violation, thereby discouraging overspeeding.Punctuality: Punctuality is an important indicator of metro train operation that affects passenger interchanges and the entire schedule. We first define the running time error as:
where Ta is the actual running time of the train and Tp is the planned trip time of the train. In this paper, if the running time error is greater than , the metro is not running on time. Therefore, the punctuality evaluation index It is defined as:
Energy efficiency: Energy consumption accounts for a large portion of train operating costs. The energy consumed is described as
and the unit mass energy efficiency evaluation index between the two stations is defined as :
Comfort: Comfort is a direct evaluation criterion for the quality of train service, which ensures that the instantaneous change in acceleration or deceleration should be below a certain threshold value. We define the rate of acceleration u change as:
Therefore, the ride comfort evaluation index Ic can be defined as:
where is the threshold for acceleration change.Parking accuracy: It is used to assess the parking accuracy, expressed as:
where sD is the length of the segment between adjacent stations and si is the current running distance of the train. Note that the parking error of the metro is generally required to be within [21] so that metro barrier doors can be opened. Therefore the parking accuracy index can be defined as:
Speed limits.
2.2 Problem objectives
This section will clearly articulate the problems that this study aims to address. The two proposed EITO algorithms ( and ) aim to achieve the following objectives corresponding to the above - mentioned problems:
Meeting multi-objective requirements: The EITO algorithms should be able to provide control strategies for traction and braking forces that can meet the requirements of multiple objectives such as safety, comfort, punctuality, parking accuracy, and energy efficiency of metro operation. Given the definitions of safety (Is), punctuality (It), energy efficiency (Ie), comfort (Ic), and parking accuracy (Ip), the algorithms need to ensure that the train operation satisfies all these evaluation indices simultaneously.Independent of offline speed profile and continuous force control: The EITO algorithms should be able to perform normal operations without considering the speed distribution of the offline design and achieve the control of continuous forces. As existing ATO systems mainly rely on offline-designed speed profiles and current intelligent driving algorithms have limitations in continuous force control, the EITO algorithms aim to overcome these drawbacks.Outperforming existing methods in energy-efficiency and comfort: The control strategy output by the EITO algorithms should outperform experienced metro drivers and current intelligent driving algorithms in terms of energy efficiency while ensuring good ride comfort. By comparing with manual driving and existing intelligent driving algorithms (such as ITOR and STON), the EITO algorithms should achieve lower energy consumption (Ie) and better comfort (Ic) performance.Adapting to different situations: The EITO algorithms should be able to flexibly adapt to different situations, including different trip times, different temporary faults (earlier or later arrival), speed limits, and gradients conditions (simple or complex). Considering the complex and variable operating conditions of metro trains, the algorithms need to adjust their control strategies accordingly to ensure stable and efficient operation.
Existing ATO systems must track the designed offline speed profile, and current intelligent driving algorithms either cannot achieve control of continuous forces or cannot adapt to complex and variable line conditions, which is the driving force behind this paper. Moreover, RL has been applied in many fields to deal with model-free problems [24], and expert knowledge has been widely used to improve control strategies [10],[11]. Therefore, in this paper, two intelligent algorithms are proposed. Namely, and , where is a heuristic algorithm based on an expert system to address multiple performance objectives of metro train operation. In addition, we develop EITO_P_ based on EITO_E_ using the PPO to comprehensively optimize the multi-objective requirements of safety, comfort, punctuality, parking accuracy, and energy efficiency.
Following the problem statement outlined above, the next section will provide a detailed introduction to the specific control models and methodologies employed to achieve these objectives.
3 EITO algorithm design
The application of expert experience-based control methods to automatic train operation control is motivated by the following two reasons. On the one hand, because the train operation control system is a highly complex, multiobjective nonlinear dynamical system [23,25], which poses great difficulties for traditional control that requires the use of its precise mathematical model; on the other hand, experienced drivers combined with their long-term accumulated maneuvering experience can implement effective control of the train in real-time so that the train operation meets the requirements of several control objectives [26].
Therefore, we first developed an expert system-based algorithm. This expert system contains expert rules and a heuristic inference system. These expert rules were summarized by our communication with metro drivers and by analyzing data from YLBS and literatures. In addition, we developed a heuristic inference method to solve without an offline speed profile reference based on the driver’s operating strategy. Then, the appropriate EITO_E_ output is obtained by combining the speed limit and the current state of the train.
Both and ensure punctuality through real-time adjustments based on current train conditions and speed limits. uses expert rules to allocate trip times effectively, while employs reinforcement learning to dynamically adjust acceleration and braking strategies. The algorithms continuously monitor the train’s position and speed, allowing them to make timely decisions that keep the train on schedule. Specifically, the reward function in penalizes deviations from planned trip times, reinforcing behaviors that promote punctuality.
3.1 EITOE
This research adhered to strict ethical standards. In data collection (human or animal), we followed relevant regulations and obtained necessary consents. Experienced drivers can meet multiple objectives well. By observing the driver’s behavior, we found that an experienced driver can control the train in the correct position, allocate the reserved time reasonably, avoid unnecessary braking, limit the train speed to prevent over-speeding, and reduce the number of switches in the controller output. Based on the study of [10],[12], we derived IF-THEN rules using position, speed, and running time as inputs and acceleration/braking rate as outputs. These rules can be described as follows.
Energy-efficient trains operate in three states, namely acceleration, coasting, and braking. The train does not transition directly from the acceleration state to the braking state and vice versa, unless a special incident is encountered. Transfer between any other two states is allowed.The acceleration of the train starting process should be appropriate for comfort (usually less than 0.6 ).For better comfort, the rate of change of acceleration in each time interval should not be too large (usually less than 0.3 ).Determine the next operation mode in advance according to the current speed and the next speed limit value to avoid triggering automatic train protection.Allocate the total trip time to each interval according to the speed limit, and try to operate according to the allocated time in each interval.
DMTD algorithm.
As mentioned before, experienced drivers consider the train’s reserved time, reserved distance, speed limit, and current speed: if the train’s speed is too low to arrive in time, the train will accelerate. Conversely, if the train’s speed is too high, the train coasts. We designed a data-driven inference method DMTD to determine the coasting or accelerating time. This inference method is manually driven and uses the twice MTD to calculate the desired speed range ( ) for the current speed limit interval. As shown in Fig 3, the calculated by this method enables energy-efficient driving by making the train coast as much as possible while ensuring punctuality.
EITOE speed distance profile.
Using the online data of the train, we first use the DMTD algorithm (see Algorithm 1) to obtain the appropriately reserved trip times and in the current speed limit interval. Then, the estimated velocity range for the speed limit of each segment is calculated from the formula below .
Algorithm 1. DMTD algorithm.
1: Get online and offline data including current train location si, speed (point in Fig 2) and speed limit . Reserve travel time , assuming , which means that the train is already in the speed limit interval .
2: Every time the train enters a speed limit interval, as shown in the red dot of Fig 2 (indicating that the train enters the second speed limit interval), we draw the maximum traction speed curve from the train position and each speed limit section. Then, from the left end of each speed limit segment, the maximum brake speed curve is drawn to obtain the minimum travel time curve.
3: Calculate the minimum reserved time from the minimum travel time curve between the current position Si and leaving the current speed limit interval and reaching the destination.
4: Calculate the reserve time for the current speed limit interval .
5: Return .
The DMTD algorithm and Eq (14) indicate that if the train is between , the train can reach its destination on time (within 3 s of the planned trip time is on schedule, setting T0=3 in the Algorithm 1). Therefore, if the train’s speed is lower than , the train needs to accelerate; if the train’s speed is higher than , the train should coast. Then, the reasoning for determining the mode of operation (coasting or accelerating) is summarized as follows.
- If , the train should accelerate, and the output of the expert system is defined as Eq (15):
where is the maximum acceleration, and is the rate of variation of the acceleration in the time interval . Note that the parameter is setted as according to expert experience. If unknown disturbances are considered, such as the resistance and gradient of the line, is not constant. And the value of this parameter will be adjusted by PPO in the next section.
- If , the train should coast and the output of the expert system can be described as Eq (16):
In addition, when the speed limit of the next section is less than the speed limit of the current section, as shown in Fig 3, the train may need to brake at a reasonable speed to ensure the safety of the train. In other words, the speed of the train should always be lower than the speed limit. In this case, we define the safe speed to monitor the speed of the train:
where si is the current position of the train. is the starting position of the next section. is the speed scaling factor caused by the time delay and friction of the railroad, which is taken as 0.95 in this paper. is the maximum deceleration speed. In this paper, . When the train operates to the position indicated by the mark 1 in Fig 3, i.e., when the current speed is higher than or equal to , then the train should immediately apply the maximum deceleration . In addition, if the length of the current speed limit interval is long enough, there will be a situation where exceeds the speed limit, which may cause the train to run beyond the speed limit (shown in Fig 3, marker 2.), so we redefine the safe speed to ensure safe driving:
We defined the parking accuracy (parking error less than ) in Eq (13), and all three automatic stop control algorithms (TASC) proposed in our previous work can achieve accurate parking of trains, the details of TASC please see the reference [25]. Therefore, we apply the heuristic online learning algorithm (HOA) of TASC at the location shown in Fig 3 mark 3 to ensure the parking accuracy of the train. The EITO’s expert system is implemented after expert rules and heuristic inference methods have been designed. As illustrated in Fig 3, EITO_E_ can make appropriate acceleration, coasting, or deceleration decisions based on online and offline data such as speed limits and gradients, as well as expert reasoning methods, and its speed profile can be divided into acceleration phase, multiple coasting phases, safety braking phase, and parking phase. Furthermore, the output is constrained by expert criteria to assure comfort and punctuality.
However, EITO_E_ cannot optimize energy consumption online, because is specified as a constant value. Therefore, the PPO is introduced to improve the performance of EITO_E_.
3.2 EITOP
RL is a machine learning paradigm that aims to learn to control systems in environments to maximize numerical performance associated with long-term goals [27]. Three reasons motivate us to adapt deep reinforcement learning in train control tasks:
(1) The EITO algorithm does not require reference to the target speed profile, while RL does not require external supervision. (2) During train control, behavior affects not only the immediate reward but also the reward for future states, which falls into the advantage of RL. (3) The use of deep reinforcement learning can modify the control strategies used in current ATO systems for discrete actions. EITOP. The algorithmic process of EITO_P_ is presented in Algorithm 2.
Markov Decision Process: Before applying the reinforcement learning algorithm, we formulate our problem as a Markov Decision Process (MDP), which provides a mathematical framework for decision making. The key elements of reinforcement learning include its state, action, policy, and reward, which are defined as follows.
State xi. In this case, the train status, with the current position, speed, and reserved trip time, can be described as:
Let x0 denotes the initial train state and xm denotes the final state. Obviously, the following equations should hold:
Action ai. In contrast to in Eq (15) in EITO_E_, EITO_P_ has a variable variation of acceleration in each state. As shown in Eqs (22)–(23), we use instead of . Meanwhile, we define the range of ui and as [–1,1] and [–0.3,0.3], respectively. Therefore, action ai can be defined for the EITO_P_:
when
when
*Policy *. The policy represents the probability of taking an action while processing a discrete action task. In this paper, since EITO is intended to address continuous action control tasks, a policy is a statistic of the probability distribution, which is expressed as Eq (25):
where is the weight.Reward function : This function defines the reward that the train receives when it takes an action in a given state. In this case, our reward function is defined by the time error , the passenger comfort and the energy consumed per unit mass in the time interval when the train takes action a_i_ in state xi.
where are determined by expert experience, and te is defined as Eq (27):
The role of and is used to ensure that the agent optimizes energy consumption while ensuring punctuality and comfort as much as possible, rather than just reducing energy consumption.
The EITO_P_ algorithm is based on the PPO algorithm [28]. PPO is a deep reinforcement learning algorithm based on policy gradient (PG). Moreover, it is based on the Actor-Critic framework capable of handling continuous action control and model-free problems. The PPO algorithm limits the update magnitude of the new policy according to the ratio of the old to the new policy so that the PG algorithm can be trained and converge at a larger learning rate. The objective function of the policy gradient algorithm is:
where denotes the policy function; is the network parameter of Actor; i denotes the state or action of the ith step; is the estimate of the advantage function of ith step, as shown in Eq (29); E denotes the empirical expectation of the time step. The advantage function is chosen at the state to compare the obtained score with the average score. If it is high, then the advantage function is positive. Otherwise, it is inverse. The gradient ascent method is used to update the value function.
where is the state action-value function, which represents the expected reward of the Agent following the policy , after performing an action ai in state xi until the end of the episode. Similarly, the state value function represents the expected reward of the Agent following the policy from the state xi to the end of the episode.
Because the PG algorithm adopts the online update policy to resampling every parameter update, its learning rate is not easy to determine. The PPO algorithm converts the online update strategy into an offline update strategy, i.e., a new and old Actor strategy is used. The training data of the new Actor can be obtained from the old Actor, while the new strategy weight is expressed using the ratio of action probabilities of the old and new strategies, which is expressed as Eq (30):
where is the sampled neural network parameter. If the probability distributions obtained for two neural network parameters and in the same state differ greatly and in the case of under-sampling, it leads to a large variance between them. Therefore, the PPO algorithm adds a CLIP function to the base of the objective function to limit the parameters and , given as follows:
The term “continuous control task” in our work refers to real - time optimization of continuous traction/braking forces without relying on predefined discrete actions or offline speed profiles. Unlike traditional methods that track fixed trajectories, our algorithms dynamically adjust acceleration/braking rates (Eq (22-(24) based on real - time states (position, speed, remaining time) and environmental conditions (speed limits, gradients). This is enabled by:
EITO_E_: Expert rules (Sect 3.1) and the DMTD heuristic (Algorithm 1) generate smooth, continuous force adjustments.EITO_P_: The PPO - based reinforcement learning framework (Sect 3.2) optimizes continuous actions ( ) in a policy gradient manner (Eq (25), allowing fine - grained control over acceleration/deceleration.
This approach eliminates abrupt state transitions (e.g., discrete coasting points in prior works [10,12]) and ensures seamless adaptation to varying line conditions (Sect 4.3).
While the core control logic is detailed in Algorithms 1 (EITO_E_) and 2 (EITO_P_), we acknowledge that the convergence analysis of PPO training could be elaborated further. For clarity: Control Steps:EITO_P_ iteratively samples actions from a Gaussian policy (Eq 25) and updates actor-critic networks using clipped surrogate objectives (Eq 31). The reward function (Eq 26) penalizes energy consumption, comfort violations, and time deviations, ensuring balanced optimization. Convergence: Fig 5 (training curves) shows energy consumption and running time stabilize after 80 episodes, indicating policy convergence.
Speed limits and gradients from RJ to WYJ.
4 Simulations
To verify the intelligence, flexibility, and robustness of EITO_E_ and EITO_P_, we designed three numerical simulation experiments based on field data collected in YLBS. YLBS started operation in Beijing on December 30,2010, with a total length of 23.3 , starting from Songjiazhuang station and ending at Ciqu station. The train type used in YLBS is DKZ32 EMU with 6 vehicles, whose parameters are shown in Table 1. To rigorously validate the suitability of EITO_P_ for online control, we conducted experiments on a workstation with the following specifications: CPU: Intel i9-10900K (10 cores, 3.7 GHz) , GPU: NVIDIA RTX 3090 (24 GB VRAM), Memory: 64 GB DDR4 , Software: Python 3.8, TensorFlow 2.6.
Table 1: Parameters of DKZ32.
Three simulation cases are presented in this section. The manual driving dataset we use in this section was collected in YLBS from May 1, 2015, to May 27, 2015, including 100 groups of up trains and down trains. We select the manual driving data with the best-generalized performance from the recorded dataset as . In Case 1, we compare the results of all algorithms ( , ITOR, STON, EITO_E_ and EITO_P_). In Case 2, we test the intelligence and flexibility of all algorithms by varying the planned trip time of the same rail segment. In Case 3, we test the operational performance of EITO models with complex gradients and speed limits to verify the robustness of proposed EITO_E_ and .
4.1 Case 1
Taking the interval between Rongjing station(RJ) and Wanyuanjie station(WYJ) as an example, the speed limit and gradient of this interval are shown in Fig 4. The planned trip time Tp =101 s is the same as the actual operation, and the distance between the two stations is 1280 .
Figs 5 and 6 show the energy consumption E and the running time (/s) during the online learning process of EITO_P_. The results show that during DRL, the energy consumption is reduced from 380 to 364 after about 80 rounds of training and gradually approaches the optimal value. In addition, the running time floats within 100 s 102 s (Tp = 101 s). According to the definition of punctuality i.e., a time error of less than 3 s is allowed. This indicates that applying the PPO algorithm in can reduce the energy consumption online while satisfying the running time error constraint.
It can be seen from Fig 7 that the and start to coast after accelerating to in the first two speed limit intervals, and the coast point of is more advanced. The coast distance of EITO_E_ and EITO_P_ are 899.78 and , respectively. Note that in the last speed limit interval, did not choose to coast at the position where started to coast. Instead, it decelerates slightly based on its current position, speed, and remaining trip time. It causes EITO_P_’s Ie and Ic can be further reduced compared to , which shows that can consider the constraints of mult-objectives in a more integrated way.
Figs 7 and 8 shows the speed distance curves of the five algorithms at a 101 s planned trip time. It can be seen the speed curve of EITO_M_ can be divided into the full acceleration phase, coasting phase, and full braking phase. EITO_E_ has the highest maximum speed of 19.49 , with four phases, the acceleration phase, multiple coasting phases, safety braking phase, and parking phase in its speed-distance curve. In addition, the coasting distances of , ITOR, and STON are 677.99 , 398.78 , and 661.46 respectively, which are significantly less than those of and , indicating that the proposed EITO algorithms have lower energy consumption.
We can see from Table 2 that all five algorithms meet the requirements of YLBS in terms of safety, punctuality, and parking accuracy. Among these algorithms, ITOR has the highest energy consumption. Compared with , ITOR is 1.7 higher than EITO_E_; STON is 11.7 lower than EITO_M_, EITO_E_ is 34.9 lower than , and can further optimize energy consumption by 4.3 based on EITO_E_. In terms of comfort for all algorithms, EITO_M_ has the highest Ic, indicating the worst passenger comfort, while the rest of algorithms have similar values for the comfort index, which is much less than EITO_M_. And EITO_P_ has the best Ie and Ic in 101 s trip time.
Table 2: Comparison of performance with different trip time.
4.2 Case 2
In this case, we verified the flexibility of five algorithms in 95 s planned trip time and 115 s planned trip time by simulating different planned trip times in the same railroad section. Since an ATO system generally needs to have offline speed recommendation curves, it is difficult to dynamically adjust the trip time. Furthermore, increasing regenerative energy requires real-time reprogramming of the planned trip time for each train on the metro line. If the train model can adjust the arrival time in real-time according to the notification, the regenerative energy can be better utilized to achieve the energy-saving operation of the metro [29],[30].
Therefore, we similarly carried out two examples of dynamically adjusting the trip time (extending or reducing the trip time) by using EITO_P_ to overcome the above shortcomings. In our simulations, such examples are called EITO with flexible adjustable trip times.
Fig 9 shows the speed distance curves for the five algorithms at a trip time of 95 s. ITOR has the highest maximum velocity of 22.22 and a shorter coasting distance, indicating that ITOR may have higher energy consumption and worse passenger comfort. STON has the lowest maximum velocity, but it decelerates too early in the second speed limit interval, resulting in a shorter coasting distance and higher energy cost. The speed distribution curves of EITO_E_ and EITO_P_ are similar as they are both smoother and have a longer coasting distance, indicating that both algorithms may perform better in terms of comfort and energy consumption. In addition, accelerates slightly in the last section where the speed-limited coasts, indicating that can adjust the arrival time of , which further illustrates the effectiveness of .
Energy consumption in the PPO process.
Running time in the PPO process.
Operation curve comparison among the EITO algorithms.
Operation curve comparison among the five algorithms in the 101s trip.
Operation curve comparison among the five algorithms in the 95 s trip.
It can be seen from Table 2 that compared with EITO_M_, the energy consumption of ITOR is 5.2 higher than that of EITO_M_; the energy cost of STON is 8.8 lower than that of EITO_M_; the EITO algorithms perform more superiorly, both saving more than 45 in energy cost compared with . In addition, in terms of riding comfort of the five methods, has the largest Ic, while ITOR, STON, and have similar Ic, which is much smaller than the EITO_M_. has the best comfort with 2.92.
Fig 10 shows the speed distance curves for the five algorithms at a trip time of 115 s. The maximum speed of is 14.73 , the maximum speed of ITOR and STON are 14.69 and 14.65 , respectively. The maximum speed of and are 16.15 and 16.10 , respectively. We can see that, compared with the speed distance curves at 95 s and 101 s planned trip times, their maximum velocities are much lower than the previous cases, indicating that they have lower average velocities and lower energy consumption.
Operation curve comparison among the five algorithms in the 115 s trip.
Furthermore, we can learn from Table 2 that all five algorithms meet the requirements in terms of safety, punctuality, and parking accuracy. Compared with , the energy consumption of ITOR is 0.5 higher than , and the energy consumption of is 4.8 higher than . The energy consumption of STON is 1.5 lower than , and the energy consumption of is 2.7 lower than . It is easy to see that has the highest energy consumption. Meanwhile, the EITO_E_’s comfort is also higher than the other three Intelligent operation models due to its multiple and large changes in u during the acceleration phase. In addition, although arrived 1 s earlier than the expected arrival time, the Ic and Ie of EITO_P_ are better than . This result further illustrates that can dynamically optimize the train’s operating state, comprehensively considering the constraints of multiple objectives. In this instance, outperforms the other four algorithms in Ic and Ie.
Fig 11 shows the running curve of the with dynamically adjusted (earlier or later) arrival time on the RJ to WYJ rail section and the originally planned trip time is 101 s. It should be noted that 15 s Later is the speed curve where the train is informed of the 15 s later arrival and 10 s Earlier is the speed curve where the train is informed of the 10 s earlier arrival. The Constant trip is the speed curve when the train is running normally within the 101 s planned trip time.
EITOP with flexible trip time.
It can be seen from Fig 11 that the first example of 15 s Later, the train will be informed to arrive to the next station 15 s later after running for 30- . The results are shown in Fig 11 and Table 3. In Fig 11, as the current remaining trip time is extended from 71 s to 86 s, the train stops immediately accelerating and starts coasting. The train then continuously reduces its operating speed by braking. In addition, Table 3 summarizes the detailed performance of the after dynamically adjusting the trip time for the inter-station operation. The final running time of 15 s Later is 114 s, meeting the punctuality index (note that Tp has been changed to 116 s). However, the sudden delay in the train’s arrival time caused the train to decelerate in a larger u at the beginning of the last speed-limited section of the interval. As a result, the passengers may feel discomfort for the increase in Ic of EITO_P_.
Table 3: EITOP with a variable trip time.
The second example is the situation when the train is informed to arrive at the station 10 s earlier after running for 10 s, which is the converse of the first one. It means that the train’s current remaining trip time suddenly reduces from 91 s to 81 s. By comparing the 10 s Earlier and Constant trip curves in Fig 11, we know that if the trip time decreases by 10 s due to an accident, EITO_P_ will intelligently change its driving strategy and accelerate for the rest of the trip. Moreover, it can be known from Table 3 that the final running time is 88 s (note that Tp has been changed to 91 s), which almost exceeds the requirement of the punctuality. It implies that despite the application of PPO and the improvement of the general performance of the metro operation, the punctuality of train operation is still affected by sudden changes in arrival times. Overall, EITO_P_ can be flexible to cope with variable trip times.
Concerning unexpected speed limit changes: During the operation, we will simulate temporary speed restrictions (e.g., due to track maintenance). Preliminary results confirm that EITO_P_ adjusts braking/coasting strategies in real time to comply with new limits while minimizing energy consumption. This new scenario will further validate the algorithm’s effectiveness in handling unforeseen circumstances commonly encountered in real - world train operations, thus enhancing the reliability and practicality of our research findings.
In our study, we’ve already conducted tests on dynamic trip time adjustments, as presented in Case 2 of Sect 4.2. In these tests, EITO_P_ has shown remarkable ability to adapt to sudden changes in the remaining journey time, as evidenced by Fig 11 and Table 3. Specifically, when it comes to sudden time reduction, if the train is notified to arrive 10 seconds earlier midway, EITO_P_ promptly responds by dynamically increasing the acceleration. As clearly shown in the “10 s Earlier” curve in Fig 11, this adjustment enables the train to meet the revised schedule. This vividly demonstrates EITO_P_’s proficiency in quickly reacting to time - constrained scenarios. It can effectively optimize the train’s operation, ensuring punctuality even under tight time pressures.
Regarding sudden time extension, when the train is informed that it can arrive 15 seconds later, EITO_P_ takes appropriate action. As depicted in the “15 s Later” curve of Fig 11, the algorithm reduces the train’s speed. By doing so, it manages to save energy while still maintaining punctuality. This not only showcases the adaptability of EITO_P_ but also highlights its remarkable capacity to strike a balance between energy consumption and punctuality, two critical factors in train operation.
In conclusion, ITOR, STON, , and can all generate reasonable control strategies and meet operating requirements for different planned trip times, thus demonstrating the flexibility of them. Besides, can dynamically adjust the train operation strategy in real-time by being informed of different arrival times (earlier or later arrival), indicating that EITO_P_ also has a degree of intelligence.
4.3 Case 3
Considering the most current studies have tested models within a single interval, and few tested the robustness of algorithms in continuous line intervals with complex speed limits and gradients. Here, taking the continuous station interval from Songjiazhuang Station(SJZ) to Xiaocun Station(XC) and then from XC to Xiaohongmen Station(XHM) as an example to test the robustness of and EITO_P_. As can be seen from Figs 4 and 12, the maximum gradient of Fig 12 (Case 3) is 500% of Fig 4 (Case 1), while the speed limits of the latter changes more dramatically.
Speed limits and gradient from SJZ to XHM.
Among them, the length from SJZ to XC is 2,631 with a planned trip time of 190 , and the length from XC to XHM is 1,274 with a planned trip time of 108 s, i.e. the total trip time and total length are 290 s and 3,905 . In this case, we ignore the stopping time, i.e., the train starts after arriving at the station and drives to the next station immediately.
Fig 13 shows the speed distance curves of algorithms running from SJZ to XHM. We can see that the trajectories of the two curves are similar, and the maximum speed of EITO_P_ is slightly lower than that of , which indicates that may consume less energy than . The average inference time for EITO_P_to generate a control action (acceleration/deceleration) at each time step (0.02 s) is 2.1 ms, which is 10 faster than the required control interval. This ensures real-time applicability even under strict operational deadlines.
Operation curve comparison among the EITO algorithms.
In addition, when the train accelerates to the coasting point in the [1161 , 2501 ] interval and starts coasting, the train gains positive acceleration. It means dangerous driving when accelerates to exceed the speed limit. The reason is that the interval has a downhill section with a gradients value of -0.008. However, due to the supervision of the Sect 3.1 safe speed , the train immediately adopts the maximum deceleration when the train is about to exceed the safe speed. This situation verifies that the proposed EITO algorithms can ensure the safety of train trips even with complicated speed limits and gradients.
The performance of EITO algorithms in complex continuous lines is shown in Table 4. The Ic of both algorithms are larger due to the more lengthy and complex line conditions. And the train has to a stop and launch operation at the XC. However, both algorithms ensure that the train arrives at its destination safely and on time. The arrival times of EITO_E_ to XC and XHM are 188 s and 112 s, respectively, while the arrival times of EITO_P_ to XC and XHM are 189 s and 109 s, respectively, and the total travel times of the two algorithms are 300 s and 298 s, respectively. It can be seen that EITO_P_ outperforms EITO_E_ in terms of both inter-station and total travel on-time performance. Furthermore, EITO_P_’s Ic and Ie are also lower than EITO_E_. It also indicates that applying the PPO algorithm based on can effectively optimize the indicators of punctuality, energy-saving and comfort of EITO_E_.
Table 4: Comparison of performance with complex continuous lines.
It can be seen from above analysis that energy-saving and comfort of both EITO_E_ and EITO_P_ have decreased moderately under the complex line conditions. However, both algorithms can ensure the train operate safely and punctually. This indicates that the proposed algorithms have good robustness. Dynamic Time Adjustment: Case 2 (Sect 4.2) demonstrates EITO_p’s ability to adapt to sudden trip time changes (e.g., s). The reward function (Eq 26) penalizes time deviations ( ), incentivizing the agent to adjust acceleration/coasting phases dynamically (Fig 11). For disturbance suppression (e.g., resistance uncertainty), EITOp’s model-free PPO framework inherently adapts to unmodeled dynamics. We will include a dedicated robustness test (e.g., sudden resistance changes) in future work. The current experiments (Cases 1–3) validate EITO_P’s adaptability to: Variable trip times (95 s, 101 s, 115 s). Complex gradients and speed limits (Fig 12). Mid-journey schedule updates (Fig 11). These scenarios inherently cover “unpredictable conditions” by testing the algorithm’s ability to replan trajectories in real time without prior offline profiles.
5 Conclusion
In this study, two EITO algorithms for intelligent train operation are proposed for addressing continuous metro operation control tasks, showcasing their ability to operate without the need for tracking offline speed profiles or relying on exact train model information. Our approach leverages an expert system to generate EITO_E_ outputs based on driver experience and redefines key elements of the Proximal Policy Optimization (PPO) algorithm to develop EITO_P_, which optimizes multiple operational objectives online. Through comparative analysis with existing intelligent driving algorithms and manual driving data, we demonstrated the superiority of our proposed algorithms in terms of safety, punctuality, energy efficiency, and passenger comfort. From the research and results of this work, several key conclusions can be drawn:
Flexible Intelligent Control: The EITO algorithms demonstrate the viability of intelligent train operations adaptable to real-time conditions, enhancing efficiency and adaptability beyond traditional, preset profiles.
Expert System Collaboration: Integrating expert system insights with EITO_E_ significantly bolsters performance, underscoring the collaboration between human expertise and machine learning in intelligent train operations.
Multi-Objective Optimization: EITO_P_ utilizes PPO and stands out as it comprehensively manages safety, punctuality, energy efficiency, and comfort, which are often conflicting objectives, through an overall control strategy.
Robustness in Complexity: Both EITO_E_ and EITO_P_ showcase robust performance in complex operational environments, with EITO_P_ particularly adept at adjusting to varying trip times, crucial for real-world operational efficiency.
Energy and Comfort Superiority: Our algorithms surpass current methods in energy conservation and passenger comfort, tackling critical urban rail transit challenges and validating their practical application.
While our algorithms are promising, future work will focus on enhancing EITO_P_’s dynamic adjustment capabilities and exploring cooperative control strategies for energy savings, including optimizing train schedules for regenerative energy utilization.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Yin J, Tang T, Yang L, Xun J, Huang Y, Gao Z. Research and development of automatic train operation for railway transportation systems: A survey. Transport Res Part C: Emerg Technol. 2017;85:548–72. doi: 10.1016/j.trc.2017.09.009 · doi ↗
- 2Tu Y, Lin S, Qiao J. Deep traffic congestion prediction model based on road segment grouping. Appl Intell. 2021;51:8519–41.
- 3Yu L, Cui M, Dai S. Deviation of peak hours for metro stations based on least square support vector machine. P Lo S One. 2023;18(9):e 0291497. doi: 10.1371/journal.pone.0291497 37703275 PMC 10499220 · doi ↗ · pubmed ↗
- 4Khmelnitsky E. On an optimal control problem of train operation. IEEE Trans Automat Contr. 2000;45(7):1257–66. doi: 10.1109/9.867018 · doi ↗
- 5Yang X, Ning B, Li X, Tang T. A two-objective timetable optimization model in subway systems. IEEE Trans Intell Transport Syst. 2014;15(5):1913–21. doi: 10.1109/tits.2014.2303146 · doi ↗
- 6Wang Y, Ning B, Tang T, van den Boom TJJ, De Schutter B. Efficient real-time train scheduling for urban rail transit systems using iterative convex programming. IEEE Trans Intell Transport Syst. 2015;16(6):3337–52. doi: 10.1109/tits.2015.2445920 · doi ↗
- 7Shang Guan W, Yan X-H, Cai B-G, Wang J. Multiobjective optimization for train speed trajectory in CTCS high-speed railway with hybrid evolutionary algorithm. IEEE Trans Intell Transport Syst. 2015;16(4):2215–25. doi: 10.1109/tits.2015.2402160 · doi ↗
- 8Akba S, Sylemez M. Coasting point optimisation for mass rail transit lines using artificial neural networks and genetic algorithms. IET Electr Power Appl. 2008;2(3):172–82.
