Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control
Yifan Zhang

TL;DR
This paper introduces a novel single-agent deep reinforcement learning framework for bus holding control in complex, realistic urban transit scenarios, outperforming traditional multi-agent methods in stability and effectiveness.
Contribution
It reformulates multi-agent bus control as a single-agent problem using high-dimensional state encoding and schedule-aware rewards, improving robustness and scalability.
Findings
Single-agent RL outperforms MARL in stability and performance.
High-dimensional state encoding captures inter-bus dependencies effectively.
Structured rewards improve schedule adherence and headway uniformity.
Abstract
Bus bunching remains a challenge for urban transit due to stochastic traffic and passenger demand. Traditional solutions rely on multi-agent reinforcement learning (MARL) in loop-line settings, which overlook realistic operations characterized by heterogeneous routes, timetables, fluctuating demand, and varying fleet sizes. We propose a novel single-agent reinforcement learning (RL) framework for bus holding control that avoids the data imbalance and convergence issues of MARL under near-realistic simulation. A bidirectional timetabled network with dynamic passenger demand is constructed. The key innovation is reformulating the multi-agent problem into a single-agent one by augmenting the state space with categorical identifiers (vehicle ID, station ID, time period) in addition to numerical features (headway, occupancy, velocity). This high-dimensional encoding enables single-agent…
| Algorithm | Embedding | |||
|---|---|---|---|---|
| SAC | Full | -369.1K 9.6K | -347.4K 13.2K | -322.8K 9.1K |
| SAC | One-hot | -362.3K 9.7K | -341.5K 10.5K | -350.9K 8.5K |
| SAC | None | -343.5K 12.7K | -347.5K 13.4K | -366.2K 10.7K |
| DDPG | Full | -714.5K 29.6K | -698.7K 42.1K | -712.6K 21.2K |
| DDPG | One-hot | -708.1K 31.9K | -698.6K 32.7K | -690.9K 31.8K |
| DDPG | None | -698.9K 30.7K | -721.2K 19.1K | -698.3K 30.0K |
| TD3 | Full | -714.0K 33.6K | -492.1K 17.9K | -701.6K 28.6K |
| TD3 | One-hot | -706.9K 24.2K | -703.9K 32.5K | -709.6K 21.3K |
| TD3 | None | -694.8K 31.3K | -690.7K 27.6K | -694.2K 26.7K |
| Algorithm | Embedding | |||
| SAC | Full | -496.3K 13.7K | -479.4K 16.5K | -436.7K 8.8K |
| SAC | One-hot | -474.2K 10.9K | -475.8K 14.4K | -486.0K 13.6K |
| SAC | None | -478.0K 11.7K | -483.5K 17.3K | -517.2K 11.5K |
| DDPG | Full | -992.2K 41.5K | -981.5K 42.7K | -989.9K 46.1K |
| DDPG | One-hot | -995.2K 45.2K | -952.5K 38.3K | -992.6K 27.3K |
| DDPG | None | -996.3K 31.6K | -1003.2K 42.7K | -990.6K 34.4K |
| TD3 | Full | -994.4K 43.1K | -634.7K 15.4K | -983.0K 44.1K |
| TD3 | One-hot | -959.6K 49.7K | -979.9K 32.3K | -979.8K 38.5K |
| TD3 | None | -989.5K 49.9K | -991.9K 64.9K | -982.8K 30.4K |
| Algorithm | Embedding | |||
| SAC | Full | -622.7K 12.5K | -608.4K 21.1K | -554.2K 12.5K |
| SAC | One-hot | -585.7K 19.2K | -594.0K 17.1K | -617.2K 19.6K |
| SAC | None | -602.5K 15.7K | -611.3K 15.4K | -651.6K 17.9K |
| DDPG | Full | -1251.4K 57.0K | -1255.4K 53.1K | -1257.9K 47.5K |
| DDPG | One-hot | -1259.6K 55.7K | -1239.6K 54.7K | -1227.3K 54.0K |
| DDPG | None | -1253.1K 57.2K | -1260.0K 62.1K | -1243.9K 49.7K |
| TD3 | Full | -1245.4K 56.2K | -780.2K 16.8K | -1268.2K 45.9K |
| TD3 | One-hot | -1243.1K 49.0K | -1247.6K 51.4K | -1253.5K 29.4K |
| TD3 | None | -1227.5K 45.7K | -1226.1K 46.8K | -1219.3K 31.5K |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Single agent robust deep reinforcement learning for bus fleet control
Yifan Zhang
Liang Zheng Corresponding author. Central South University, [email protected]
Abstract
Bus bunching remains a critical challenge in urban transit systems, primarily driven by the stochastic nature of traffic conditions and passenger demand. Recently a popular method to address this issue is multi-agent reinforcement learning (MARL) applied in an idealized loop-line environment. However, such method generally suffers from high computational costs and sample inefficiency. Moreover, they often fail to capture the dynamics of realistic bus systems, which are typically governed by heterogeneous trip line and variable fleet sizes. In this study, we propose a robust single-agent reinforcement learning (RL) framework for bus holding control in a bidirectional timetabled bus line, explicitly designed to circumvent the data imbalance and convergence issues associated with MARL. Our key contribution focuses on transforming the inherently multi-agent problem into a single-agent formulation by explicitly encoding categorical identifiers—such as vehicle, station, and trip IDs alongside traditional numerical features (e.g., headway, occupancy and segment velocity) together as the state representation. This feature space augmentation enables the single-agent to operate effectively in a higher-dimensional space, analogous to projecting linearly inseparable inputs into a higher-dimensional space to achieve separability. Additionally, we introduce a structured "ridge-shaped" reward function that incentivizes the alignment with both uniform headways and scheduled departure intervals. Compared to the other three benchmark methods, our proposed RL algorithm achieves significantly more stable and higher rewards (k comparing with k) under stochastic passenger demands and inter-station travel time. These experimental results suggest that the proposed single-agent RL approach, informed by categorical identifiers in the state representation and realistic schedule-aware design in the ridge-shaped reward function, can effectively mitigate bus bunching in non-loop settings. This paradigm offers a robust and scalable alternative to those conventional MARL-based control frameworks, particularly in environments where agent-specific experience distributions are inherently imbalanced.
Keywords: bus bunching; reinforcement learning; soft actor-critic; bus holding control.
1 Introduction
Bus bunching remains a pervasive challenge in contemporary urban public transportation systems. This phenomenon occurs when two or more buses operating on the same route cluster together, thereby compromising service reliability and prolonging passenger waiting times. While often attributed simply to schedule deviations, empirical observations indicate a more intricate mechanism, driven by the dynamic interplay between traffic conditions and passenger demand [2]. Notably, bunching tends to emerge not necessarily during peak rush hours, but often in the transitional intervals preceding them. During these periods, a sudden surge in passenger demand or a reduction in average road speed can significantly decelerate a leading bus, due to extended dwell times or congestion [14]. Concurrently, the trailing bus, encountering fewer passengers or reduced congestion, begins to close the gap. If the trailing bus eventually overtakes the leader, it then inherits the heightened demand at subsequent stops, decelerating in turn. This self-reinforcing cycle where the new leader is persistently burdened by increased demand results in buses alternately decelerating and accelerating relative to one another, eventually converging into a stable cluster [23]. Over time, such unstable headway dynamics deteriorate the overall temporal regularity of the service. This phenomenon underscores the critical importance of developing control strategies that can adapt in real-time not only to current headway but also to demand and speed conditions that exhibit high volatility. Traditional methods, as well as conventional reinforcement learning (RL)-based approaches, often fail to accommodate the asynchronous, event-driven nature of these interactions.
Recent advancements in bus fleet coordination have increasingly leveraged multi-agent reinforcement learning (MARL) frameworks, enabling individual vehicles to develop decentralized policies based on local observations. While MARL has demonstrated success in idealized environments such as loop-line or circular routes where the fleet size and trip distribution remain static, its applicability to realistic, bidirectional, and timetable-governed networks is hindered by several fundamental limitations.
Theoretically, MARL-based approaches are highly susceptible to severe data imbalance across agents. This phenomenon is particularly evident in real-world operations where buses deployed exclusively during peak hours (e.g., the 13th bus in a morning rush) accumulate substantially fewer experiences compared to those operating continuously, leading to unstable or suboptimal policy convergence. Furthermore, unlike continuous loop-line settings, bidirectional lines are inherently finite and bounded by terminal stops. This structural difference prevents agents from accumulating rewards over multiple loops, thereby impeding effective credit assignment and the propagation of long-term rewards across truncated episodes. This challenge is further exacerbated by the asynchronous, event-driven nature of transit systems. Unlike traditional MARL benchmarks like robotic soccer, which rely on synchronous decision-making, bus operations involve agents making decisions at irregular, temporally misaligned intervals[22, 21].
Our research addresses these gaps by introducing a robust single-agent deep reinforcement learning framework explicitly tailored to the complexities of realistic, timetable-based operations. By reformulating the inherently multi-agent task into a single-agent problem through the use of categorical identifiers (such as vehicle and station IDs), we circumvent the data imbalance and convergence issues typical of MARL. This approach enables a unified policy to generalize across a heterogeneous fleet, effectively learning a robust control law that accounts for the asynchronous interactions and finite trip structures inherent in bidirectional transit corridors.
In pursuit of this objective, this paper presents four key contributions. First, we have developed a realistic bidirectional bus simulation environment designed to bridge the gap between theoretical RL research and practical transit deployment. Unlike the commonly employed loop-line settings in previous studies, which assume fixed fleet sizes and uniform trip distribution among vehicles across the fleet, our environment operates under real-world constraints such as time-varying demand, asymmetric direction stops, stochastic traffic conditions, and dynamic fleet activation based on timetable triggers. Second, we propose a novel single-agent RL framework that reformulates the multi-agent bus control problem as a single-agent task. By explicitly embedding categorical features (e.g., vehicle ID, stop ID, and trip ID) and concatenating them with continuous features such as headway and segment speed, our model enables the agent to generalize across heterogeneous buses and temporal contexts. This design leverages the intuition behind high-dimensional feature mappings analogous to kernel methods in machine learning where linearly inseparable patterns become linearly separable once augmented with sufficient structural feature dimensions. Third, we design a structured reward function that replaces common exponential-based heuristics with a “ridge-shaped” topology, which simultaneously incentivizes uniform headways and adherence to the scheduled inter-vehicle departure intervals. This design draws upon principles from transit service planning literature [5], emphasizing that the schedule itself is the core determinant of service quality. Fourth, we empirically demonstrate that soft actor critic (SAC), even in its standard form, provides inherent robustness in stochastic and dynamically disturbed environments, aligning with recent theoretical findings that connect maximum entropy RL with robustness guarantees [8]. Extensive experiments confirm that our SAC-based approach outperforms the MARL baseline multi-agent deep deterministic policy gradient (MADDPG) by achieving reduced variance and superior reward accumulation, particularly under peak-period stress scenarios.
The structure of this paper is organized as follows: Section 2 provides a review of related literature on bus bunching issues and reinforcement learning methodologies within public transportation systems, with a particular focus on bus operations. Section 3 outlines the problem formulation and simulation environment, incorporating the bus operation model and stochastic passenger demand. Section 4 details the proposed method, specifying the reinforcement learning framework and reward formulation. Section 5 presents the experimental results and performance comparisons. Finally, Section 6 concludes the paper and discusses directions for future research.
2 Related work
2.1 Bus bunching mitigation strategies
Bus bunching mitigation strategies have traditionally been categorized into station-based and inter-station-based control approaches [5]. Station holding aims to regulate headways by delaying early-arriving buses [4], while inter-station control adjusts cruise speeds or leverages traffic signal priority to maintain spacing [1]. Recent achievements have explored integrated, hybrid strategies that synthesize both control dimensions. For instance, [13] proposed a multi-strategy system leveraging deep reinforcement learning to unify stop-level holding, speed guidance, and signal adaptation. While such integration improves adaptability under complex conditions, it often introduces training instability, especially in high-dimensional or temporally asynchronous environments. Furthermore, a significant portion of the existing literature assumes closed-loop or loop-line configurations [6, 3], which oversimplifies realistic bidirectional and timetable-governed operations.
Our work addresses these limitations by developing a learning architecture compatible with flexible scheduling and dynamic fleet sizes. Traditional bus control strategies often fail to fully account for the inherent uncertainties in urban transit systems, such as fluctuating passenger demand and stochastic travel times. To address this, recent research has shifted toward robust decision-making frameworks. For instance, [27] developed a robust nonlinear decision mapping approach for online bus speed control under uncertainty, which was further extended by [11] into a bi-objective optimization framework that accounts for both implementation errors and traffic volatility. While these approaches demonstrate the efficacy of robust mapping in speed regulation, they often rely on static control laws. In contrast, our work leverages the adaptive nature of SAC to learn a robust station-based holding policy, explicitly tailored to handle the complex dynamics of bidirectional, timetabled networks under extreme stochastic conditions.
2.2 Reinforcement learning approaches in transit systems
Reinforcement learning, particularly MARL, has emerged as a prominent tool for optimizing transit operations under stochastic passenger demand and traffic fluctuations. Notably, [22] and [12] introduced asynchronous MARL frameworks designed to handle event-driven control through macro-actions, thereby significantly reducing policy horizon complexity. These efforts have been extended in hierarchical MARL studies such as [24], where high-level agents coordinate holding and acceleration actions within a decentralized paradigm.
Despite these advancements, MARL approaches often encounter scalability and data imbalance issues, especially when vehicles operate sporadically or are governed by temporally sparse policies. This phenomenon, frequently observed during peak-only deployments or irregular schedules, leads to convergence instability and sample inefficiency [22]. While hierarchical and curriculum-based RL frameworks [20, 24] have been proposed to address these issues, they often require extensive domain-specific design and expert priors.
In contrast, our work leverages a single-agent RL architecture with explicit encoding of categorical identifiers (e.g., time period, direction), enabling uniform policy learning across all vehicles regardless of deployment frequency. This transformation effectively mitigates data imbalance and improves generalization without necessitating multiple cooperating agents.
2.3 Single-agent soft actor-critic and robustness guarantees
SAC has emerged as a state-of-the-art off-policy RL algorithm known for its entropy-maximizing objective, which balances exploration and exploitation [9]. In environments subject to noise and temporal disturbances, SAC has demonstrated superior sample efficiency and stability compared to deterministic policy gradients or Q-learning variants [8].
Recent theoretical achievements [8] have established that SAC approximately solves a robust reinforcement learning objective of the form:
[TABLE]
where the inner minimization spans a bounded uncertainty set over reward functions and transition kernels . This equivalence indicates that entropy-regularized RL policies inherently exhibit robustness to bounded adversarial perturbations in both reward and dynamics.
Our approach builds on SAC, but extends it by embedding categorical features for developing policies in a highly stochastic bidirectional bus environment. By introducing time-varying origin-destination (OD) flows and Gaussian-distributed travel time disturbances, we aim to enhance the robustness of the learned policy. Combined with a structured, ridge-shaped reward function that emphasizes headway regularity and schedule adherence, our method not only achieves robust convergence but also aligns operational decisions with service quality objectives.
2.4 (robust) Optimization-based approaches
Recent simulation-optimization based studies have extensively investigated bus scheduling and operational strategies, extending to electric bus fleet management and rescheduling. For instance, [19] optimized a single-line electric bus fleet services by using skip-stop as the action, resolving the trade-off between efficiency and passenger demand. Furthermore, [18] explored hybrid transit services with modular autonomous vehicles, introducing a flexible and demand responsive service concept. The robustness of schedule performance was also examined through vehicle-type selection and departure-time shifting in electric bus routes [16]. At the fleet level, [17] investigated replacement strategies for variability kinds of electric buses, particularly, the heterogeneity in vehicle characteristics. Additionally, multi-line methodologies have been adapted for single-line operations to enhance the stability of fleet headway and passenger service quality [15]. While these contributions provide valuable optimization frameworks, they largely rely on deterministic formulations, which may face challenges in transit environments characterized by high uncertainty, such as fluctuating demand and travel time variability. To bridge this gap, we propose a reinforcement learning-based approach capable of robustly adapting to dynamic demand and uncertain travel conditions.
Robust simulation-based optimization has also been explored in multiobjective settings. For example, [25] proposed robust approaches for constrained multiobjective problems, while [26] investigated biobjective robust optimization in unconstrained settings. These works highlight the critical relevance of robust optimization techniques for complex stochastic environments.
3 Problem formulation and simulation environment
3.1 Bus operation mode
This study considers a general bidirectional scheduled bus corridor system. The route comprises a set of stations , including two terminal stations and intermediate stops. The system operates within a fixed service window .
The vehicle dispatching process adheres to a pre-defined timetable spanning a predefined service window . Buses are dispatched at a constant scheduled headway in both directions. Service from the upstream terminal commences at , while the downstream terminal starts with an offset to stagger departures. This offset time ensures an initial separation between the two directions, facilitating the analysis of headway evolution by minimizing inter-directional interference.
At each scheduled dispatch time:
- •
If an idle vehicle is available at the terminal, it is assigned to the trip.
- •
If no vehicle is available, a new bus is activated.
Each vehicle executes a trip from one terminal to the other. Upon completion a trip, the vehicle becomes idle and available for subsequent assignments. This mechanism inherently models dynamic fleet sizing and resource constraints.
3.2 Passenger demand and traffic dynamics
Passenger demand is characterized by Time-Dependent Origin-Destination (OD) flows. Let denote the arrival rate of passengers traveling from stop to stop during time period . Passenger arrivals are assumed to follow a Poisson process driven by these time-varying rates.
Road traffic conditions are modeled as stochastic variables. The travel time or speed between adjacent stops and follows a distribution (e.g., Gaussian) defined by a time-varying mean and variance . This stochasticity captures real-world disturbances such as congestion and traffic signal delays.
3.3 Illustrative example of bunching dynamics
To illustrate the instability of open-loop bus operations, consider two consecutive vehicles, Bus and Bus , scheduled with headway . The Bus Bunching phenomenon emerges from a positive feedback loop:
- •
Initial Perturbation: Suppose Bus is slightly delayed by a stochastic disturbance (e.g., traffic signal).
- •
Passenger Accumulation: The arrival interval at the next station extends (), causing excess passengers to accumulate beyond expectation.
- •
Dwell Time Expansion: Increased boarding demand requires prolonged dwell time, further delaying Bus (the “slow get slower” effect).
- •
Compression of Following Headway: Conversely, Bus arrives closer to Bus (), encounters fewer waiting passengers, and dwells for a shorter time (the “fast get faster” effect).
Consequently, the gap diminishes continuously until the two buses clump together. This inherent instability necessitates active control strategies to break the loop.
3.4 Decision timing and state transition
Unlike traditional methods that trigger decisions immediately upon bus arrival at a stop (denoted in Fig. 1) [6, 27], our model strategically postpones decision-making to a post-service epoch after all passenger boarding and alighting activities have completed. This moment, denoted as in Fig. 1, aligns more robustly with operational control opportunities in practice.
At , the agent receives a state observation and generates a control action. The environment then executes the action (e.g., holding), and upon reaching the subsequent stop at , the agent receives a reward based on the resulting system state (e.g., headway smoothness). This enables the RL agent to respond to the most current post-service information, leading to more stable and informed control decisions.
Figure 1 illustrates the holding decision process and headway computation using three sample buses traveling along a corridor. Each polyline represents the trajectory of an individual bus, where time progresses along the horizontal axis and station index increases along the vertical axis. The darkest bus symbol on each line denotes the most recently visited stop, while progressively lighter segments indicate historical stops previously visited. Holding durations are overlaid along each segment: red bars indicate the current holding time decision, and orange bars denote previous holding actions.
The dash line red rectangle in the figure highlights a critical region used to compute both forward and backward headways. Specifically, the horizontal time difference between and quantifies the forward headway of bus and simultaneously the backward headway of bus . This quantity becomes observable only when both buses have completed dwell operations at the same station, ensuring that the headway reflects the realized temporal interval between two consecutive active vehicles.
4 Methodology
4.1 Overall methodology framework
Our control framework transforms the multi-agent bus holding problem into a single-agent reinforcement learning task. To address the heterogeneity of diverse vehicles and stops within a unified policy, we employ categorical embedding layers to map discrete identifiers (e.g., bus_id, stop_id) into dense latent representations. This allows a single SAC agent to control the entire fleet by generalizing across varied spatio-temporal contexts.
The core control logic is built upon an event-driven variant of the soft actor-critic (SAC) algorithm. Unlike standard step-synchronous RL where decisions occur at fixed time intervals, our agents’ actions and rewards are triggered asynchronously by bus-arrival events at stations.
As illustrated in Algorithm 1, we maintain two dictionaries indexed by bus_id: action_dict (to store the state-action pair at the moment of decision) and state_dict (to track the latest observable state). A complete transition tuple is only committed to the replay buffer when the same vehicle reaches its subsequent stop, ensuring the reward reflects the realized headway regularity.
Remarks. (i) The asynchronous tuple assembly prevents temporal leakage by pairing actions with their actual realized outcomes. (ii) The dictionaries function as a per-bus latch, accommodating a varying number of active vehicles without requiring a multi-agent formulation. (iii) Parameters such as and are environment-specific and are detailed in Section 5.
4.2 Components of reinforcement learning
Rather than assigning an individual policy to each vehicle, the system employs a single reinforcement learning agent that observes the operational context of each bus at decision points and outputs control actions accordingly. Unlike conventional time-stepped RL settings, decisions here are triggered by discrete events specifically, upon completion of passenger boarding and alighting at a stop. Let denote the time when bus arrives at stop , and be the moment when all passenger activities are completed. At time , the agent receives a state observation , obtains a reward , and selects a continuous holding action . This action determines how many additional seconds the vehicle will hold at the stop after natural dwelling, with the goal of improving headway regularity. The subsequent state is only observed when bus completes boarding and alighting at stop , i.e., at . The entire process is illustrated in Fig. 1.
State Representation.
The state vector includes both categorical and numerical features. The categorical component consists of:
[TABLE]
where time_period indicates the current hour of operation derived from total elapsed time . These are encoded through learnable embedding layers to accommodate nonlinearity and improve generalization. The numerical features are:
[TABLE]
which directly reflect local traffic and service level explicitly. This hybrid structure enables the policy network to incorporate heterogeneous information sources and handle data imbalance arising from irregular trip activation. This address a common challenge in real-world systems where specific vehicles operate solely during peak periods, resulting in skewed experience distributions across agents. By embedding categorical variables such as bus and trip identifiers, the single-agent framework generalizes across heterogeneous spatio-temporal contexts without requiring separate parameterizations for each vehicle instance.
Action Definition.
While inter-station-based speed control could theoretically provide finer granularity in maintaining regular headways, we deliberately adopt station-based holding control as our primary action space. This choice is driven by the following practical constraints observed in real-world bus operations:
- •
Safety and road conditions: Dynamic speed adjustment is constrained by urban traffic rules, varying road conditions, and safety requirements. Buses operating in mixed traffic environments are restricted from arbitrarily accelerating or decelerating.
- •
Vehicle inertia: Acceleration or deceleration behavior is affected by bus occupancy levels. Heavily loaded buses exhibit higher inertia, limiting their responsiveness to speed commands.
- •
Action uncertainty: In realistic scenarios, instructions to modify speed are characterized by interpretation and execution by human drivers, introducing uncertainty in action realization. This makes direct speed control less reliable than station-based holding.
- •
Operational preference: Bus companies are more inclined to adopt stop-level holding strategies since these can be clearly communicated to drivers and executed unambiguously. Many transit agencies already implement static holding as part of daily operations.
Therefore, the action in our reinforcement learning formulation corresponds to holding time at the current station. This is not only operationally feasible and enforceable but also aligns with industry best practices for mitigating bus bunching while minimizing disruption to passenger experience.
So we define the action as a scalar value sampled from a bounded continuous space , where is a preset upper limit (detailed in Section 5). It corresponds to the additional dwell time (holding) a bus will incur beyond the necessary passenger exchange duration. This continuous formulation contrasts with discrete or macro-action settings in prior studies, allowing finer control granularity [6, 4].
Reward Function Design
The reward is designed to promote both symmetry in headways and schedule adherence. At each stop , once a bus completes its boarding process, it receives a scalar reward based on its and . The function consists of three components:
- •
Schedule alignment: A soft penalty proportional to the deviation from the nominal target headway of .
- •
Headway symmetry: A similarity bonus inversely proportional to the absolute difference between and , encouraging platoon balance.
- •
Robust penalization: Additional penalty is applied when either headway deviates from the target by more than .
Let
[TABLE]
where:
[TABLE]
Here rewards headways close to , while dynamically weighs the forward and backward components. The final penalty term accounts for extreme outliers.
As illustrated in Fig. 2, the reward surface exhibits a prominent ridge along the line , where vehicle spacing is both balanced and schedule-aligned. Conversely, deviations along the direction produce sharp declines in reward due to penalized asymmetry. This geometry ensures that the agent is incentivized not only to maintain consistent spacing, but also to adaptively adjust dwell times based on real-time headway context. Although the reward surface is clearly ridge-shaped, a static plot may not fully convey its three-dimensional geometry. For better intuition, an interactive and rotatable visualization of the reward function is available as the smoothed_reward_surface.html file in our public repository at https://github.com/erzhu419/Categorical-Feature-sac-in-bus-simulation, where all experimental codes and assets used in this study are also provided.
Although many prior studies advocate inter-station cruise speed control [7, 13], we explicitly choose station-based holding in this work due to three pragmatic considerations rooted in real-world operations:
Limited action executability: Speed control inherently requires the driver or vehicle control unit to continuously adjust velocity profiles in response to subtle policy outputs. In realistic bus operations, this is often impractical: driver response is delayed, bounded by road conditions, and strongly affected by onboard load-induced inertia. Consequently, the actual control action ( e.g., reducing speed or accelerating by 10%) may not be reliably executed, introducing action uncertainty into the control loop. 2. 2.
Safety and regulatory constraints: Rapid speed modulation to maintain headway (especially deceleration) often conflicts with traffic safety rules and leads to passenger discomfort, which transit agencies are keen to avoid. This makes speed control operationally and contractually infeasible in many public transit settings. 3. 3.
Holding as the de facto industry norm: In practice, holding buses at stations (especially terminal or major stops) is a widely accepted operational strategy, as it is easier to communicate, enforce, and integrate into static schedules. Agencies also prefer holding because it enables pre-emptive coordination (e.g., scheduled rest periods or crew shifts) without increasing perceived risk.
Therefore, we model bus control as discrete holding actions only, to better reflect feasible and enforceable policy deployment in real-world bidirectional corridors. While inter-station control remains a promising theoretical extension, its robustness under human-in-the-loop and partially controllable systems deserves separate investigation.
Transition and Experience Buffer.
Due to the asynchronous nature of events, transitions cannot be immediately constructed. Instead, each agent maintains a temporary buffer indexed by its identifier. When a bus completes its stop interaction at , it stores . Once the next state is available at , the full transition tuple is assembled and inserted into the global replay buffer. This design accommodates interleaved transitions from multiple agents and enables stable training in partially observable and event-driven environments.
4.3 Feature representation and embedding network
To enable a unified policy to generalize across different vehicles, stops, and temporal contexts, we explicitly incorporate four categorical variables into the state space:
[TABLE]
These features do not carry intrinsic numerical meaning but are critical for distinguishing instances with context-specific dynamics. Rather than employing one-hot encoding which would lead to high-dimensional sparse inputs, we adopt an embedding-based approach.
Each categorical variable is mapped to a dense, learnable vector through an embedding layer. Let denote the value of the -th categorical feature, and let be the corresponding embedding matrix, where is the number of unique categories and is the embedding dimension (typically ). Then the embedded representation is:
[TABLE]
All categorical embeddings are then concatenated:
[TABLE]
This dense vector is further concatenated with the numerical part of the state vector:
[TABLE]
where is the current route segment speed. We also investigated including the speeds of all route segments as input features, but observed negligible improvement in performance.
The final input vector is fed into both the Q-networks and policy network in the SAC framework. Fig. 3 illustrates this process.
This architecture not only reduces the number of required agents thereby indirectly reducing the dimensionality of the learning problem, but also enables the network to capture semantic similarity between different context entities. For instance, two buses operating during the same peak hour or at adjacent stops can share latent representations via shared embedding weights, improving generalization and training stability. The use of entity embeddings is particularly important in our single-agent setting, where data distribution is inherently imbalanced across different vehicle instances.
4.4 Soft actor-critic framework
SAC is an off-policy deep reinforcement learning algorithm that integrates the maximum entropy principle into the actor-critic paradigm [9, 10]. Unlike traditional actor-critic algorithms that focus purely on maximizing cumulative reward, SAC augments the objective with an entropy regularization term. This encourages the learned policy to remain stochastic during training, which enhances exploration and naturally promotes more robust behavior in environments characterized by noise or partial observability.
Formally, sac optimizes the following objective:
[TABLE]
where is the policy entropy, and is a temperature parameter balancing reward and entropy. In high variance stochastic environment setting, this structure is particularly useful for ensuring the policy does not collapse to greedy behavior too early during training, which is crucial given the asynchronous, unevenly distributed bus experience data.
To evaluate and update the policy, SAC relies on a soft version of the Bellman backup operator, which iteratively estimates the soft Q-function under the current policy:
[TABLE]
[TABLE]
Here, is the soft state-action value function, and is the soft state value function implicitly defined by entropy adjustment. The role of the Bellman backup is to converge toward the true Q-values under policy , forming the core target signal for critic learning.
The critic networks are trained by minimizing the following squared soft Bellman residual:
[TABLE]
In our implementation, two Q-networks and are trained simultaneously to reduce overestimation bias, and the target network value is computed using the smaller of the two Q-values and a slowly updated target network . This stabilizes critic convergence under highly variable headway dynamics and OD noise.
The policy is modeled as a squashed Gaussian distribution whose samples are passed through a nonlinearity to enforce bounded actions. This is critical in our domain, where the action represents additional holding time and must lie within a predefined range.
Policy learning is performed by minimizing the expected KL divergence between the policy and a softmax over the Q-function, originally expressed as:
[TABLE]
Here, the normalization constant ensures that the exponentiated Q-function defines a proper probability density over the action space. Notably, is independent of the current policy , and thus does not affect gradient-based updates.
By expanding the KL divergence and taking gradients w.r.t. , this yields the policy loss:
[TABLE]
which is the standard sac actor loss minimized during training. To avoid manual tuning of the temperature parameter , SAC treats it as a learnable dual variable and updates it via dual gradient descent to enforce a target entropy. Specifically, is optimized by minimizing:
[TABLE]
where denotes the desired entropy level. This mechanism ensures that the policy maintains high stochasticity in the early stages of training and gradually becomes more deterministic as convergence is approached. The inner term encourages the policy to choose actions with high Q-values while maintaining sufficient entropy.
SAC is well-suited to our asynchronous, event-driven dispatching environment. It allows for off-policy updates using irregular, vehicle-dependent transitions and gracefully handles nonstationary feedback signals. In contrast to MARL methods that require dense and balanced interactions across agents, our SAC implementation can learn a single policy that generalizes over time, space, and bus identity.
4.5 Equivalence of SAC and robust reinforcement learning
The primary challenge in our bus holding control problem stems from dynamic uncertainty: travel times between stops exhibit stochastic variability throughout the day due to congestion and irregular passenger demand. This manifests primarily as uncertainty in the transition dynamics rather than the reward function itself. From a reinforcement learning perspective, this is fundamentally a dynamically robust decision problem where the policy must perform well even under small, adversarial perturbations to the dynamics model.
Directly addressing dynamic robust RL problems is often intractable, especially in continuous control settings. However, Theorem 4.2 in [8] offers critical insight: optimizing a maximum entropy RL objective with a modified reward function under the unperturbed dynamics establishes a lower bound on the robust objective under the true (perturbed) dynamics . Specifically, instead of optimizing:
[TABLE]
we optimize:
[TABLE]
which serves as a tight lower bound on the robust formulation.
As the agent accumulates sufficient trajectories such that the sample mean of stabilizes to a constant across time the Jensen gaps introduced in the theoretical derivation (see Eq. 9 in [8]) become small, and the lower bound becomes increasingly tight. This behavior is analogous to Bayesian optimization, where high posterior variance early in training warrants broad exploration, but posterior uncertainty shrinks as more samples are acquired. Similarly, the stochastic policy induced by SAC begins with high entropy (large ) to hedge against uncertainty in reward and dynamics, and gradually converges to a more deterministic policy as uncertainty reduces.
Admittedly, this analogy is imperfect Bayesian posterior variance captures epistemic uncertainty, while real-world bus dynamics involve irreducible aleatoric uncertainty (e.g., signal timing fluctuations or passenger boarding variability). Nonetheless, the mechanism aligns well: early-stage high entropy enables SAC to explore and tolerate dynamics variability, while later-stage entropy decay naturally aligns the policy with the most likely transition behavior.
By leveraging SAC, we implicitly optimize a provably grounded surrogate for the dynamic robust control problem we care about without explicitly modeling worst-case perturbations. This makes SAC particularly well-suited to our setting, and empirically, it demonstrates greater stability than standard MARL baselines under the same simulation conditions.
5 Experimental setup and results
5.1 Experimental setup and hyperparameters
To evaluate the performance and robustness of our proposed holding control method under realistic transit conditions, we develop a custom discrete-event simulation environment that models bidirectional bus operations with non-stationary(time-varying) demand and traffic uncertainty. The experimental setup is structured as follows.(You can find our source code at https://github.com/erzhu419/Categorical-Feature-sac-in-bus-simulation.git)
5.1.1 Simulation environment and data sources
The simulator is implemented in Python 3.8 and is designed to simulate dynamic bus fleet operations along a corridor with two terminal stations. Each simulation episode begins at 6:00 AM and runs until all scheduled trips in the daily timetable are completed.
Passenger demand and traffic conditions are synthetically generated following the modeling framework proposed in Zheng et al. [27]. Specifically:
- •
Passenger demand is specified by an hourly OD matrix stored in passenger_OD.xlsx. For each pair of stations and each time period, the matrix specifies the expected number of passengers boarding per hour. During simulation, passenger arrivals at each stop are sampled from a Poisson distribution with a rate derived from the OD entry divided by 3600, generating fine-grained second-by-second demand.
- •
Traffic speed is provided in route_news.xlsx, which contains the average speed (in m/s) for each inter-stop segment, varying by hour. Actual travel speed during simulation is sampled from a Gaussian distribution with the recorded mean and a fixed standard deviation of , reflecting random perturbations in traffic conditions. This setup closely mirrors the uncertainty modeling approach in [27].
- •
Route structure is described in stop_news.xlsx, which defines 22 stops including two terminals and 20 intermediate stops. Each vehicle operates in a single direction per trip, either from terminal_down to terminal_up (“up” direction) or the reverse (“down” direction).
- •
Timetable-driven dispatching is defined in time_table.xlsx. The operation starts at 6:00 AM. Vehicles are dispatched every 360 seconds (6 minutes) from both terminals. An offset of 180 seconds is applied to the downstream direction to stagger departures (e.g., Upstream starts at 6:00:00, Downstream at 6:03:00). Once a vehicle completes a trip and returns to its terminal, it may either rest or be reassigned to a subsequent trip contingent on schedule availability.
This bidirectional, schedule-triggered structure extends the unidirectional simulation approach in [27], and enables the modeling of peak vs. off-peak asymmetry, heterogeneous headway patterns, and dynamic fleet scaling.
Specific Case Study Parameters. In the experiments presented in this section, we instantiate the general model with the following reliable real-world settings:
- •
Service window hours (06:00 to 19:00);
- •
Scheduled headway seconds;
- •
Dispatch offset seconds;
- •
Number of stops .
5.1.2 Network architectures and hyperparameters
All models are implemented using PyTorch. The architectural and training configurations are:
Embedding-SAC, DDPG and TD3(single-agent RL)
- •
Embedding layers: for each categorical variable (bus ID, station ID, time period, direction), a learnable embedding is constructed with dimension set as , where is the number of unique values for category ;
- •
Actor and critic networks: both are 4-layer multilayer perceptrons (MLPs) with hidden sizes of [32, 32, 32], followed by task-specific output layers (mean/log-std for policy; scalar value for Q-function);
- •
Activation function: ReLU;
- •
Optimizer: Adam;
- •
Learning rate: ;
- •
Batch size: 2048;
- •
Target smoothing coefficient: 0.005.
MADDPG (PS/NPS)
- •
Actor and critic networks: 3-layer MLPs with layer sizes (where output_dim is determined by the action or Q-value dimension);
- •
Activation function: ReLU;
- •
Optimizer: Adam;
- •
Learning rate: for both actor and critic;
- •
Batch size: 128;
- •
Target smoothing coefficient: 0.01 (Polyak averaging);
- •
Parameter sharing: In PS, all agents share a single actor and critic network, with agent IDs encoded as part of the input; in NPS, each agent trains its own independent actor and critic networks.
- •
Exploration noise: Gaussian noise with standard deviation 0.2, clipped to the action bounds;
- •
Replay buffer size: transitions per agent.
Batch Size Selection.
The batch size settings for MADDPG and single-agent RL differ significantly because the agents in MARL algorithms often share the data in replay buffer they collected from their individual experiences. Conversely in single-agent RL, there is only one agent and one buffer. Every experience is stored in this single buffer, and the batch size is scaled according to the total number of experiences available. To maintain the same number of available transition tuples for every single-agent, which prevents any single-agent from being overwhelmed by the experiences of others, the batch size varies substantially between the single-agent RL and the MARL algorithms.
In contrast, single-agent RL employs a shared replay buffer across all buses, which allows it to sample from a larger and more diverse dataset. This design enables single-agent RL to leverage experiences from other buses in the same fleet, improving generalization and stability. However, the most relevant data for single-agent RL still comes from experiences associated with the same bus ID and station ID as the current decision point. Given that the fleet size is approximately 20 (varying slightly with traffic conditions), dividing the single-agent RL batch size by the fleet size provides an effective state-action batch size comparable to MADDPG’s setting. For example, with a single-agent RL batch size of 2048, the effective per-bus batch size is approximately , aligning with MADDPG’s batch size.
This approach balances the benefits of global experience sharing with the need for local relevance, ensuring that single-agent RL maintains both robustness and specificity in its policy updates.
5.2 Performance comparison
Figure 4 presents the training reward trends of the single-agent RL method versus two variants of MADDPG: one with parameter sharing (MADDPG-PS) and one without (MADDPG-NPS). To better reveal convergence dynamics and stability, we display both 10-episode rolling means (solid lines) and exponentially weighted moving averages (EWM, dashed lines), along with shaded areas indicating the corresponding standard deviations. A horizontal dashed gray line is included to represent the average reward of the uncontrolled case, serving as a performance baseline for comparison.
Combining the training dynamics from Figure 4 with the cross-variance robustness results in Table 1, several critical observations emerge regarding algorithm performance under stochastic transit conditions:
(1) SAC’s consistent superiority across all noise levels. SAC outperforms both TD3 and DDPG by a substantial margin under all testing conditions. At , SAC achieves rewards in the range of K to K, while TD3 and DDPG stagnate at approximately K to K performance that is effectively halved. This performance gap widens dramatically as environmental stochasticity increases: at , SAC maintains rewards around K to K, whereas TD3 and DDPG completely fail to converge, yielding catastrophic rewards near K to K. This demonstrates SAC’s fundamental advantage in handling the high-dimensional, variance-sensitive state space inherent to bus fleet control.
(2) Catastrophic failure of TD3 and DDPG under high variance. Table 1 reveals a striking pattern: as increases from 1.0 to 2.0, TD3 and DDPG exhibit near-total collapse. Their mean rewards degrade by 3–4 times compared to low-noise conditions, with DDPG performance dropping from approximately K at to below K at . TD3 displays particularly erratic behavior, with anomalous spikes (e.g., K under certain configurations for ) suggesting training instability and poor generalization. In contrast, SAC’s performance degrades gracefully from approximately K to K across the same variance range indicating robust policy learning that is not overfitted to specific noise realizations.
(3) Full embedding architecture achieves best results under high stochasticity. Examining SAC variants across embedding configurations, the Full embedding (incorporating vehicle ID, station ID, time period, and direction) consistently delivers the best performance at higher noise levels: K at , K at , and K at . This pattern confirms that rich categorical representations are essential for capturing the heterogeneity and temporal dynamics of bus operations. Simpler representations (One-hot and None) perform adequately under low noise but lack the representational capacity to maintain robustness as variance increases, highlighting the importance of learned embeddings in stochastic environments.
(4) Rapid convergence and superior sample efficiency. Figure 4 illustrates that SAC converges to stable high rewards within approximately 30 episodes, while MADDPG-PS and MADDPG-NPS require over 150 and 180 episodes, respectively. SAC’s plateau near K contrasts sharply with MADDPG-PS at K and MADDPG-NPS at K, all far superior to the uncontrolled baseline near K. Moreover, the shaded variance bands around SAC remain narrow throughout training, confirming both fast convergence and low variability critical for real-world deployment where sample efficiency and reliability are paramount.
Summary: The combined evidence from training curves and cross-variance evaluation demonstrates that SAC with full categorical embeddings is uniquely suited for robust bus fleet control. While TD3 and DDPG completely fail under realistic noise levels (), SAC maintains both high performance and stability, converging 5–6 times faster than multi-agent alternatives. This superiority stems from SAC’s entropy-regularized objective and its ability to leverage rich learned representations of vehicle identity, spatiotemporal context, and directional heterogeneity capabilities essential for real-world transit systems operating under unpredictable demand and traffic conditions.
To further assess robustness and the contribution of categorical embeddings, we conduct a full ablation study across noise levels and three input representations: (i) No categorical features (IDs removed), (ii) One-hot encoding, and (iii) Learnable embeddings. Each model (SAC, TD3, DDPG) is trained under every and evaluated under each , producing a total of combinations. Table 1 summarizes the best results for each testing variance (complete results are provided in the supplementary CSV cross_sigma_all_results.csv; the subset of best scores is stored in cross_sigma_best_results.csv).
Several critical insights emerge from Table 1: (1) SAC dominates across all noise levels. Under every condition, SAC variants consistently outperform TD3 and DDPG by factors of 2–4. Even SAC’s worst configuration (None embedding at : K) substantially exceeds the best TD3 or DDPG results at the same noise level (approximately K to K). (2) TD3 and DDPG fail catastrophically under high variance. At and , both algorithms exhibit near-complete collapse, with rewards degrading to K to K worse than 3 their low-noise performance. This suggests fundamental brittleness in deterministic policy gradient methods when applied to high-dimensional, stochastic transit control. (3) Full embedding excels under stochasticity. SAC with Full embedding achieves the best reward in every setting (bolded in table): K, K, and K for respectively. In contrast, simpler representations (One-hot, None) show diminished robustness as noise increases, confirming that learned categorical embeddings are essential for generalizing across diverse operational conditions. (4) Cross-variance generalization is algorithm-dependent. SAC’s performance degrades smoothly from K to K as doubles, while TD3 and DDPG exhibit erratic, non-monotonic behavior (e.g., TD3 Full’s anomalous K and K spikes), indicating poor policy stability and overfitting to specific noise realizations. Overall, these results demonstrate that SAC with rich categorical embeddings uniquely combines high performance, robustness to variance shifts, and stable cross-domain generalization properties critical for real-world transit systems where demand and travel times vary unpredictably.
5.3 Empirical evidence of bus bunching and control effectiveness
To evaluate the effectiveness of our proposed SAC-based holding strategy and to visually demonstrate how bus bunching occurs and propagates, we present a series of trajectory-based and statistical analyses.
Figure 5 visualizes the complete operational period (06:00–19:00) for both the controlled and uncontrolled scenarios. Each colored line depicts the trajectory of an individual bus as it travels between terminals, traversing all 22 stops. In the uncontrolled scenario (b, bottom panel), frequent bunching events are observed highlighted by red segments indicating intervals where two or more buses operate in close proximity. These bunching occurrences are most prevalent during the early morning (06:30–08:30) and late afternoon (16:00–18:00) periods, immediately preceding the peak demand hours.
In contrast, the controlled scenario (a, top panel), where SAC regulates holding decisions at stops, demonstrates a complete absence of bunching events throughout the episode. Bus trajectories remain uniformly spaced, confirming the robustness and effectiveness of the proposed control strategy under stochastic demand and travel speed conditions.
Figure 6 provides a fine-grained analysis of the spatio-temporal distribution of bunching events observed in the uncontrolled case. Subplots (a) and (b) identify the top 7 stops most frequently involved in bunching, stratified by travel direction. These are predominantly downstream stops near the end of each route, where accumulated delays due to traffic precipitate bus convergence. Subplots (c) and (d) depict hourly bunching frequencies in both directions. The risk of bunching is most pronounced during transitional hours especially 07:00–09:00 and 16:00–18:00 which precede the peak intensity of rush hour demand. This corroborates our earlier hypothesis that bunching is not exclusively a peak-hour phenomenon, but rather the result of demand-speed imbalances accumulating in the periods preceding peak load.
Together, these findings reinforce the key design motivations of our work:
- •
Bunching tends to emerge in pre-peak and transitional periods due to localized demand or speed perturbations;
- •
Spatially, bunching intensifies near the end of each trip when the bus has been exposed to long sequences of stochastic disturbances;
- •
A control strategy that integrates spatio-temporal context facilitated by SAC with categorical embedding can effectively eliminate bunching under significant stochasticity.
These results validate the proposed modeling and control approach, and demonstrating that a robustly trained single-agent RL policy can generalize across dynamic fleet states and operational conditions.
6 Conclusion
This paper presents a novel single-agent reinforcement learning framework for mitigating bus bunching in realistic, bidirectional, timetabled transit systems. Departing from conventional multi-agent approaches designed for idealized loop-line environments, our method explicitly incorporates categorical embeddings—such as vehicle ID, station ID, and trip identifiers into the SAC policy. This enables a single-agent to generalize across heterogeneous agents and spatio-temporal contexts, overcoming the limitations of data imbalance and agent-specific training instability inherent in traditional MARL settings.
To emulate operational realism, we construct a bidirectional bus simulation environment based on empirically derived passenger demand and route speed profiles. The environment features a timetable-driven dispatch mechanism and stochastic variability in both road conditions and boarding/alighting dynamics. This setup, extending and refining the methodology proposed by Zheng et al. [27]—supports more faithful modeling of real-world operational challenges.
Through extensive experiments, we demonstrate that Embedding-SAC achieves better performance in terms of cumulative mean reward value, asymptotic stability, and variance of cumulative reward value compared to both parameter-sharing and non-sharing variants of MADDPG. Notably, our control policy eliminates all bunching events across the entire simulation period, as confirmed by trajectory-level visualizations and spatio-temporal heatmaps. These results validate that a well-structured single-agent policy, when leveraging appropriate embedded features, can effectively optimize holding decisions under uncertainty.
In future work, we plan to extend this framework to incorporate non-stationary demand distributions, transfer learning across corridors, and integration with high-level scheduling modules. We also intend to explore hierarchical and latent-graph models to further disentangle causal mechanisms in bus dynamics.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant No. 72371251), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No. 2024JJ2080), and the Key Research and Development Program of Hunan Province of China (Grant No. 2024JK2007).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bie Yiming, Xiong Xinyu, Yan Yadan, and Qu Xiaobo (2020) Dynamic headway control for high-frequency bus lines based on speed guidance and intersection signal adjustment . Computer-Aided Civil and Infrastructure Engineering 35 ( 1 ), pp. 4–25 . Cited by: §2.1 .
- 2[2] Bus bunching - wikipedia . Note: https://en.wikipedia.org/wiki/Bus_bunching Accessed: 2025-07-29 Cited by: §1 .
- 3[3] Cats Oded and Glück Stefan (2019) Frequency and vehicle capacity determination using a dynamic transit assignment model . Transportation Research Record: Journal of the Transportation Research Board 2673(3) , pp. 574–585 . Cited by: §2.1 .
- 4[4] Cats Oded, Nabavi Larijani Anahid, Olafsdottir Asdis, Burghout Wilco, Andreasson Ingmar, and Koutsopoulos Haris N. (2012) Bus-holding control strategies: simulation-based evaluation and guidelines for implementation . Transportation Research Record 2274 ( 1 ), pp. 100–108 . Cited by: §2.1 , §4.2 .
- 5[5] Ceder Avishai (2016) Public transit planning and operation: modeling, practice and behavior . Second edition , CRC Press . Cited by: §1 , §2.1 .
- 6[6] Cortés Cristián E., Sáez Doris, Milla Freddy, Núñez Alfredo, and Riquelme Marcela (2010) Hybrid predictive control for real-time optimization of public transport systems’ operations based on evolutionary multi-objective optimization . Transportation Research Part C: Emerging Technologies 18 ( 5 ), pp. 757–769 . Cited by: §2.1 , §3.4 , §4.2 .
- 7[7] Daganzo Carlos F. and Pilachowski Josh (2011) Reducing bunching with bus-to-bus cooperation . Transportation Research Part B: Methodological 45 ( 2 ), pp. 267–277 . Cited by: §4.2 .
- 8[8] Eysenbach Benjamin and Levine Sergey (2022) Maximum entropy rl (provably) solves some robust rl problems . In International Conference on Learning Representations (ICLR) , Cited by: §1 , §2.3 , §2.3 , §4.5 , §4.5 .
