Efficient Autonomy Validation in Simulation with Adaptive Stress Testing
Mark Koren, Mykel Kochenderfer

TL;DR
This paper introduces an improved adaptive stress testing method using recurrent neural networks to efficiently and robustly identify failure scenarios in autonomous systems, especially in complex, continuous initial condition spaces.
Contribution
The authors develop a recurrent neural network-based solver for adaptive stress testing, enabling generalization across continuous initial conditions and reducing computational complexity.
Findings
Successfully identified failure scenarios in autonomous driving simulations
Reduced computational complexity compared to previous methods
Demonstrated effectiveness in complex, realistic scenarios
Abstract
During the development of autonomous systems such as driverless cars, it is important to characterize the scenarios that are most likely to result in failure. Adaptive Stress Testing (AST) provides a way to search for the most-likely failure scenario as a Markov decision process (MDP). Our previous work used a deep reinforcement learning (DRL) solver to identify likely failure scenarios. However, the solver's use of a feed-forward neural network with a discretized space of possible initial conditions poses two major problems. First, the system is not treated as a black box, in that it requires analyzing the internal state of the system, which leads to considerable implementation complexities. Second, in order to simulate realistic settings, a new instance of the solver needs to be run for each initial condition. Running a new solver for each initial condition not only significantly…
| Variable | Min | Max |
|---|---|---|
| MCTS | MLPDRL | DRDRL | GRDRL Point | GRDRL Bin | |
|---|---|---|---|---|---|
| Average Collision Reward | |||||
| Max Collision Reward | |||||
| Collisions Found | 21 | 29 | 30 | 25 | |
| Collision Percentage |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Efficient Autonomy Validation in Simulation
with Adaptive Stress Testing
Mark Koren1 and Mykel J. Kochenderfer1 1Mark Koren and Mykel J. Kochenderfer are with Aeronautics and Astronautics, Stanford University, Stanford, CA 94305, USA {mkoren, mykel}@stanford.edu
Abstract
During the development of autonomous systems such as driverless cars, it is important to characterize the scenarios that are most likely to result in failure. Adaptive Stress Testing (AST) provides a way to search for the most-likely failure scenario as a Markov decision process (MDP). Our previous work used a deep reinforcement learning (DRL) solver to identify likely failure scenarios. However, the solver’s use of a feed-forward neural network with a discretized space of possible initial conditions poses two major problems. First, the system is not treated as a black box, in that it requires analyzing the internal state of the system, which leads to considerable implementation complexities. Second, in order to simulate realistic settings, a new instance of the solver needs to be run for each initial condition. Running a new solver for each initial condition not only significantly increases the computational complexity, but also disregards the underlying relationship between similar initial conditions. We provide a solution to both problems by employing a recurrent neural network that takes a set of initial conditions from a continuous space as input. This approach enables robust and efficient detection of failures because the solution generalizes across the entire space of initial conditions. By simulating an instance where an autonomous car drives while a pedestrian is crossing a road, we demonstrate the solver is now capable of finding solutions for problems that would have previously been intractable.
I INTRODUCTION
Simulation can offer an inexpensive complement to field-testing for evaluating the safety of autonomous vehicles [1, 2, 3]. Such simulations can run faster than real-time and can more easily probe safety critical scenarios that cannot be obtained in real-world environments due to the rarity of events, cost incurred with failures, and ethical considerations. However, the space of edge-cases that can cause the autonomous vehicle to fail is vast [4].
Consider a pedestrian crossing a neighborhood road at a crosswalk, a problem we use as a running example and which is shown in Figure 1. A naive approach could be to assume the pedestrian follows a straight line trajectory across the road. The scenario could then be simulated thousands of times with different pedestrian velocities. While this approach may be tractable, the computational savings come at the expense of safety. In reality, pedestrians do not only follow a straight line. Perhaps certain pedestrian paths take them into a sensor blind-spot, or elicit unsafe behavior from the test vehicle. Even unlikely collisions are significant due to the large amount of miles an autonomous fleet will drive. It is not sufficient to assume other actors will always follow sensible trajectories. Unfortunately, modeling the behavior of other actors leads to an exponential explosion in possible scenarios and failures. Identifying critical scenarios through a brute-force search would be intractable due to the dimensionality of the search-space. Instead, researchers are focusing on adaptive methods for adversarially generating critical test scenarios in simulation [5, 6, 7, 8].
Adaptive Stress Testing (AST) provides a framework for finding the most-likely failure scenario of a system in simulation [9]. Knowing the most-likely failure is useful for both the development and the safety validation of an autonomous vehicle. In AST, the problem of finding the most-likely failure of a system is formulated as a Markov decision process (MDP). Failures can be found using reinforcement learning techniques. The process of solving an AST problem is shown in Figure 2. At each time-step, the solver provides environment actions that control the simulator. The simulator reports when a failure occurs and outputs the likelihood of the environment actions. Reinforcement learning techniques can be used to solve the MDP, with the reward function depending on the likelihood of actions taken and whether a failure was found. We recently introduced a deep reinforcement learning (DRL) solver that was able to find failures in an example autonomous vehicle scenario more efficiently than an existing Monte Carlo tree search (MCTS) solver [10]. However, there are two major limitations that make using the solver more challenging: the solver’s dependence on the simulation state and requirement to be run from a single initial condition.
The primary limitation of the DRL solver is its dependence on observing the simulation state, the collection of internal state variables that fully define the simulation. Many high-fidelity simulators are large, complicated software suites. Consequently, exposing the simulation state may be non-trivial; therefore it is useful to treat the simulator as a black box. In the current formulation of AST, the method is able to treat the simulation as a black box by using the history of actions taken as a substitute for the current state [9]. The system under test (SUT) must therefore be deterministic with respect to the environment actions, so that an exact mapping from the history of actions to simulation state exists. The DRL solver represents the AST policy as a feed-forward neural network, where the simulation state is used as input. It would be advantageous to use the history of actions as input, but the solver architecture is not optimal for such a representation. Instead, an architecture designed for sequential data would be preferable.
The second limitation of the DRL solver is that it can only be trained for a single initial condition. Consider again the pedestrian at the crosswalk. Simulating this scenario with the car starting away from the crosswalk could have a significantly different outcome than if the car started away. When validating an autonomous vehicle, we are interested in both of these instantiations, as well as the range in-between. All of the possible initial conditions comprise a class of scenarios, and different instantiations could lead to different failure modes. An ideal validation method would therefore be able to cover the entire initial condition space. Unfortunately, we currently would have to discretize the space of initial conditions and train a new DRL solver for each bin, as shown in Figure 3(a). Even on small problems, the space of initial conditions can be high-dimensional, making a discretized representation intractable. Instead, we would like to train a single DRL solver that can take any initial condition in the continuum of our defined space.
In order to effectively use AST while validating complex autonomous vehicles, this paper extends the DRL solver by changing the policy architecture to a recurrent neural network (RNN), which has two advantages:
RNNs are designed for sequential tasks, therefore the simulation state is no longer needed as input. The RNN takes the previous action as input, and uses it to internally maintain a hidden state. This is analogous to using the history of actions as the current state. 2. 2.
Specific types of RNNs have shown the ability to learn temporal patterns. This is essential when working with different initial conditions, since trajectories could end up in similar states that have different expected values due to reaching the state different times. Therefore, we are able to add the initial state as input to the network, and the network will learn to generalize across the space of initial conditions.
The new architecture therefore addresses the two major limitations of the old DRL solver. We will demonstrate the improvements by roughly discretizing the space of initial conditions and comparing the performance of the new architecture against a MCTS solver and the current DRL solver. Generalization will then be demonstrated by letting the new architecture sample from the space of initial conditions at train time. Improving the DRL solver will make autonomous vehicle validation in simulation more tractable, leading to vehicles that are more reliable and robust.
II BACKGROUND
Adaptive stress testing formulates the problem of finding the most-likely decision as an MDP. Two methods are used to solve this problem, MCTS and DRL.
II-A Markov Decision Process
Markov decision processes (MDPs) are a framework for formulating sequential decision making problems [11]. In an MDP an agent chooses an action in state . The agent receives a reward according to the reward function . The agent transitions to the next state according to the transition probability function . According to the Markov assumption, the transition only depends on the current state and action. Neither the transition nor reward functions need to be deterministic. In some cases, the transition or reward functions may not be known.
The solution to an MDP is represented by a policy that specifies the optimal action to take in a given state. An optimal action maximizes the expected utility, which can be found recursively:
[TABLE]
where is the discount factor that controls the weight of future rewards. Large MDPs may need to be solved approximately. Two examples of reinforcement learning techniques for finding the approximate solution to an MDP are Monte Carlo tree search and deep reinforcement learning.
[TABLE]
II-B Monte Carlo Tree Search
Monte Carlo tree search (MCTS) [12] is an online sampling-based reinforcement learning method that has performed well on large MDPs [13]. MCTS builds and maintains a tree where the nodes represent states or actions in the MDP. While executing from states in the tree, MCTS chooses the action that maximizes
[TABLE]
where is the expected value of a state-action pair, and are the number of times a state and a state-action pair have been visited, respectively, and is a parameter that controls exploration. When a new state is encountered, and are initialized for all of the available actions, and the state is added to the tree. Then, is updated by executing rollouts to a specified depth and returning the expected value. The algorithm is run until a stopping criterion is met. This paper uses a variation of MCTS with double progressive widening [14] to limit the branching of the tree, which leads to better performance on problems with large or continuous action spaces.
II-C Deep Reinforcement Learning
Deep reinforcement learning (DRL) represents a policy as a neural network (NN) parameterized by . Recurrent neural networks (RNN) are a family of NNs designed to handle sequential inputs. Recurrent neural networks maintain a hidden state, which propagates information through time. RNNs factor historical information into their output through the hidden state. The network maintains a set of weights for both the hidden state and the output. While RNNs traditionally can be difficult to train due to the exploding gradient problem, long-short term memory (LSTM) layers fix this by introducing gated self-loops which enforce constant error flow [15].
Trust Region Policy Optimization (TRPO) is a gradient-based method for improving the policy [16]. TRPO generally gives monotonic increases in policy performance by constraining the KL divergence between policy steps. The policy gradient can be obtained using generalized advantage estimation (GAE) [17], a method for estimating policy gradients from batches of simulation trajectories.
III Methodology
When validating autonomous systems, stress testing is the process of eliciting failures to evaluate the robustness of the system. This section outlines the process involved in using AST to find the most-likely failure, based on the material presented above. We also explain the changes made to the DRL solver to improve the performance and add a new capability.
III-A Adaptive Stress Testing
Adaptive stress testing formulates finding the most-likely failure of a system as a sequential decision-making problem. Given a simulator and a subset of the state space where the events of interest (e.g. a collision) occurs, we want to find the most-likely trajectory that ends in our subset . Given , the formal problem is
[TABLE]
where is the probability of a trajectory in simulator and .
AST requires the following three functions to interact with the simulator:
- •
Initialize: Resets to a given initial state .
- •
Step: Steps the simulation in time by drawing the next state after taking action . The function returns the probability of the transition and an indicator whether is in or not.
- •
IsTerminal: Returns true if the current state of the simulation is in , or if the horizon of the simulation has been reached.
Unlike previous formulations, the Initialize function now accepts an initial state. The purpose of this change is to output a policy that can generalize to different scenario instantiations.
III-B Recurrent Deep Reinforcement Learning Solver
We previously added a new deep reinforcement learning (DRL) solver to AST [10]. The solver is interchangeable with the commonly-used MCTS solver. The previous implementation required the simulation state as input, which was an undesirable relaxation of the black-box simulator assumption. Treating the simulation as a black box allows easier implementation for complicated or third-party simulators, for which the simulator’s internal state may not be accessible. As such, we have redesigned the DRL solver to meet the black-box assumption.
The AST agent must control all stochasticity in the simulation, therefore transitions are deterministic with respect to the AST agent’s actions. Because the SUT updates are deterministic, the history of actions and the initial state are sufficient to represent the current state. Consequently, the simulator is allowed to be non-markovian. Replacing the simulation state with the history of actions also fulfills the black-box assumption, because the simulation state is no longer needed as input. The previous DRL solver, referred to hereafter as the MLPDRL solver, used a Gaussian multi-layer perceptron (MLP) architecture, which does not work well with this state representation.
Instead of a Gaussian MLP, the policy is now represented by a recurrent neural net (RNN), using long-short term memory (LSTM) layers. The network architecture is shown in Figure 4. An RNN is able to train on a sequence of inputs while maintaining a hidden state, which is analogous to using a history of previous actions as the current state. The output of the policy is a mean action vector for a multivariate Gaussian distribution. The diagonal covariance matrix is independent of state and trained separately [16]. The only input to the network is the previous action, hence . While the simulation state is no longer needed, this solver can only be run from a single initial condition, therefore we will refer to it as the discrete recurrent deep reinforcement learning (DRDRL) solver.
III-C Continuous Scenario Generalization
Previous work with AST provided a trajectory from a discrete initial condition. As discussed earlier, engineers are often concerned with a scenario that starts from a space of initial conditions. Even using a coarse grid discretization, the number of possible initial conditions is still exponential in the dimension of the initial condition space. Despite the increased efficiency of AST, running from a large number of initial conditions for a single class of scenarios would take an impractical amount of compute time. Instead, we would like to have a single solver that can find a likely failure trajectory from any initial condition.
By modifying the Initialize function to accept the initial state, we hypothesize that AST can learn a policy that generalizes to the entire space of initial conditions. Our hypothesis arises because autonomous vehicles test cases are run on models of the real world. Consequently, light deviations in position and noise should produce similar policies. Therefore, training AST across the space of initial conditions should be far more sample efficient than running AST for individual instantiations. The architecture is therefore modified so the input at each time-step is the concatenation of the previous action and the initial condition, hence . During training, each rollout starts from a randomly sampled intial condition. We will refer to this architecture as the generalized recurrent deep reinforcement learning (GRDRL) solver.
IV Experiments
This section outlines the problem used in simulation to test AST, the hyper-parameters of the DRL solver, and the reward structure. For bench-marking purposes, we follow the experiment setup—simulation, pedestrian models, and SUT model—proposed in our previous work [10]. The problem has a 5-dimensional state-space and a 6-dimensional action space, and is run for up to 50 time-steps.
IV-A Problem Formulation
Our experiment simulates a common neighborhood road driving scenario, shown in Figure 1. The road has one lane in each direction. A pedestrian crosses at a marked crosswalk, from south to north. The origin is at the center of the crosswalk, and the origin is where the crosswalk meets the side of the road. The speed limit is , which is .
The inputs to the GRDRL solver include the initial state where
- •
is the initial , location of the pedestrian,
- •
is the initial position of the car,
- •
is the initial velocity of the pedestrian, and
- •
is the initial velocity of the car.
Initial conditions are drawn from a continuous uniform distribution, with the supports shown in Table I. Trajectory rollouts are instantiated by randomly sampling an initial condition from the parameter ranges.
IV-B Modified Reward Function
AST penalizes each step by the likelihood of the environment actions, as shown in Equation 3. Unlikely actions have a higher cost, so the solver is incentivized to take likelier actions, and therefore find likelier failures. The Mahalanobis distance [18] is used as a proxy for the likelihood of an action. The Mahalanobis distance is a measure of distance from the mean generalized for multivariate continuous distributions. The penalty for failing to find a collision is controlled by and . The penalty at the end of a no-collision case is scaled by the distance (dist) between the pedestrian and the vehicle. The penalty encourages the pedestrian to end early trials closer to the vehicle and leads to faster convergence. We use -1\text{\times}{10}^{5}\text{,}\text{/} and $\beta=$-1\text{\times}{10}^{4}\text{\,}\text{/}. The reward function is modified from the previous version of AST [9] as follows:
[TABLE]
where is the Mahalanobis distance between the action and the expected action given the covariance matrix in the current state . The distance between the vehicle position and the closest pedestrian position is given by the function .
IV-C Solvers
The DRL solver uses the new recurrent architecture shown in Figure 4. The hidden layer size is 64. Training was done with a batch size of time-steps. The maximum trajectory length is , hence each batch has trajectories. The optimizer used a step size of , and a discount factor of . The DRL approach was implemented using rllab [19].
V RESULTS
This section shows the performance of the new solvers on our running example. First, the solvers ability to train on the problem are compared to each other. Both solvers are then compared to baselines to show their improvement.
V-A Overall Performance
The goal of AST is to understand failure modes by returning the most-likely failure. An advantage of the new architecture is being able to search for the most-likely failure from a space of initial conditions while training a single network. Figure 5 demonstrates these benefits by showing the cumulative maximum reward found by the DRDRL and GRDRL solvers at each iteration. There are two estimates shown for the DRDRL architecture:
- •
Sequential: Each discrete AST solver is run sequentially. The naive approach serves as a lower bound on the performance of the discrete architecture.
- •
Batch: The AST solvers are updated as a batch. Each batch is assumed to still take 32 iterations, but the best reward of the best solver is known after each update.
The generalized architecture outperforms the discrete architecture at every iteration. The generalized version finds a collision sooner and converges to a solution after about 100 iterations, whereas the discrete architecture is still improving after 500 iterations. Furthermore, the generalized version is able to find a trajectory that has a net Mahalanobis distance of . In contrast, the discrete version’s most-likely solution was . Over the entire space of initial conditions, running the generalized architecture is more accurate in far fewer iterations than running the discrete architecture at discrete points.
V-B Comparison to Baselines
Table II shows the aggregate results of the new architectures as well as two baselines: the old multi-layer perceptron architecture and a Monte Carlo tree search (MCTS) solver. The data was generated by dividing the 5-dimensional initial condition space into 2 bins per dimension, which resulted in 32 bins. Such a rough discretization is unsafe, but the number of bins is equal to , where is the number of bins per dimension. Using 3 bins per dimension, which is hardly much safer, would result in training 243 instances of AST. Even on a toy problem, running AST for a safe number of discrete points is intractable. However, to demonstrate the performance benefits of the GRDRL architecture, we ran the MCTS, MLPDRL, and DRDRL solvers at the center-point each of the 32 bins. The GRDRL solver was trained on the entire space of initial conditions, and evaluated in two ways: 1) by executing the GRDRL solver’s policy from the same 32 center-points of the other solvers were tested at, referred to as point evaluation, and 2) by sampling from each bin in the initial condition space and keeping the best GRDRL solution, referred to as bin evaluation. The performance of the DRDRL and GRDRL solvers are discussed below.
V-B1 Discrete Recurrent Deep Reinforcement Learning Solver
The DRDRL solver performs similarly to both baselines. The DRDRL solver’s average collision over the 32 bins was slightly worse than the MLPDRL solver and significantly worse than the MCTS solver. However, the DRDRL solver found crashes in many more bins than the MCTS solver, and found the most-likely collision of the three solvers. Finding the most-likely collision was the primary goal of the three solvers, therefore the DRDRL solver performed better than both baselines, despite not having access to the simulation state like the MLPDRL solver.
V-B2 Generalized Recurrent Deep Reinforcement Learning Solver
The GRDRL solver far outperforms both baselines, as well as the DRDRL solver. When evaluating over the entire bin, the GRDRL solver found collisions in every single bin, and had by far the best average and maximum collision rewards. The maximum reward in particular demonstrates both the strength and necessity of the new solver architecture. The most-likely collision was not at one of the 32 points tested, hence a discretization approach does not find the most-likely trajectory. Surprisingly, though, the GRDRL solver also outperforms the other solvers at the 32 center-points. Despite not training from the center-points specifically, the GRDRL solver has a better average and maximum collision reward. The only degradation in performance was in collision percentage, although the GRDRL solver still outperforms the MCTS solver.
VI Conclusion
This paper presents a new architecture for AST to improve the validation of autonomous vehicles. The new solver treats the simulator as a black box and generalizes across a space of initial conditions. The new architecture is able to converge to a more-likely failure scenario in fewer iterations than running the discrete architecture. Future work will demonstrate performance in a high-fidelity simulator. The advancements presented in this paper show that AST is a promising approach for validating autonomous systems.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] V. Agaram, F. Barickman, F. Fahrenkrog, E. Griffor, I. Muharemovic, H. Peng, J. Salinger, S. Shladover, and W. Shogren, “Validation and Verification of Automated Road Vehicles,” in Road Vehicle Automation 3 , G. Meyer and S. Beiker, Eds. Springer, 2016, pp. 201–210.
- 2[2] P. Koopman and M. Wagner, “Challenges in autonomous vehicle testing and validation,” SAE International Journal of Transportation Safety , vol. 4, no. 1, pp. 15–24, 2016.
- 3[3] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?” Transportation Research Part A: Policy and Practice , vol. 94, pp. 182–193, 2016.
- 4[4] P. Koopman, “The heavy tail safety ceiling,” in Automated and Connected Vehicle Systems Testing Symposium , 2018.
- 5[5] G. E. Mullins, P. G. Stankiewicz, R. C. Hawthorne, and S. K. Gupta, “Adaptive generation of challenging scenarios for testing and evaluation of autonomous vehicles,” Journal of Systems and Software , vol. 137, pp. 197–215, 2018.
- 6[6] M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi, “Scalable end-to-end autonomous vehicle testing via rare-event simulation,” in Advances in Neural Information Processing Systems , 2018, pp. 9827–9838.
- 7[7] C. E. Tuncali, G. Fainekos, H. Ito, and J. Kapinski, “Simulation-based adversarial test generation for autonomous vehicles with machine learning components,” in IEEE Intelligent Vehicles Symposium , 2018, pp. 1555–1562.
- 8[8] L. Li, W.-L. Huang, Y. Liu, N.-N. Zheng, and F.-Y. Wang, “Intelligence testing for autonomous vehicles: A new approach,” IEEE Transactions on Intelligent Vehicles , vol. 1, no. 2, pp. 158–166, 2016.
