Learning to Predict Ego-Vehicle Poses for Sampling-Based Nonholonomic   Motion Planning

Holger Banzhaf; Paul Sanzenbacher; Ulrich Baumann; J. Marius Z\"ollner

arXiv:1812.01127·cs.RO·February 4, 2019

Learning to Predict Ego-Vehicle Poses for Sampling-Based Nonholonomic Motion Planning

Holger Banzhaf, Paul Sanzenbacher, Ulrich Baumann, J. Marius Z\"ollner

PDF

Open Access

TL;DR

This paper presents a data-driven CNN approach to predict ego-vehicle poses, significantly improving sampling efficiency and convergence speed in nonholonomic motion planning for automated vehicles.

Contribution

It introduces a novel CNN-based method for generating problem-specific sampling distributions, enhancing the efficiency of sampling-based motion planning in complex driving scenarios.

Findings

01

CNN predicts vehicle poses more accurately than uniform sampling and A*-based methods.

02

Combining CNN with Bidirectional RRT* reduces computation time by up to tenfold.

03

Achieves 100% success rate in tested scenarios.

Abstract

Sampling-based motion planning is an effective tool to compute safe trajectories for automated vehicles in complex environments. However, a fast convergence to the optimal solution can only be ensured with the use of problem-specific sampling distributions. Due to the large variety of driving situations within the context of automated driving, it is very challenging to manually design such distributions. This paper introduces therefore a data-driven approach utilizing a deep convolutional neural network (CNN): Given the current driving situation, future ego-vehicle poses can be directly generated from the output of the CNN allowing to guide the motion planner efficiently towards the optimal solution. A benchmark highlights that the CNN predicts future vehicle poses with a higher accuracy compared to uniform sampling and a state-of-the-art A*-based approach. Combining this CNN-guided…

Tables5

Table 1. Table I : Quantitative comparison of the hyperparameter optimization on the basis of 200 200 200 sampled vehicle poses carried out on the validation dataset. The displayed values are given with mean and standard deviation.

$l r$	$γ_{CE}$	$γ_{MSE}$	$G$ [ $%$ ]	$D$ [ $-$ ]
$10^{- 3}$	25	25	20.7 $\pm$ 18.9	0.64 $\pm$ 0.26
$10^{- 4}$			16.9 $\pm$ 18.6	0.35 $\pm$ 0.33
$10^{- 5}$			10.1 $\pm$ 12.0	0.33 $\pm$ 0.35
$10^{- 5}$	10	10	12.6 $\pm$ 14.9	0.26 $\pm$ 0.26
	10	25	12.7 $\pm$ 15.3	0.26 $\pm$ 0.28
	10	100	12.5 $\pm$ 14.9	0.25 $\pm$ 0.25
	25	10	11.2 $\pm$ 13.4	0.31 $\pm$ 0.35
	25	25	10.1 $\pm$ 12.0	0.33 $\pm$ 0.35
	25	100	10.8 $\pm$ 12.9	0.33 $\pm$ 0.33
	100	10	09.8 $\pm$ 10.6	0.45 $\pm$ 0.40
	100	25	10.3 $\pm$ 11.6	0.46 $\pm$ 0.45
	100	100	10.3 $\pm$ 11.5	0.42 $\pm$ 0.35

Table 2. Table II : Parameters of the Orientation-Aware Space Exploration.

parameter	value
minimum radius	$0.2 m$
maximum radius	$5.0 m$
minimum clearance	$1.041 m$
neighbors	$32$
maximum curvature	$0.1982 m^{- 1}$
computation timeout	$1.0 s$

Table 3. Table III : Benchmark of uniform sampling, the OSE approach, and the CNN prediction on the test dataset. The metrics G 𝐺 G and D 𝐷 D as well as the computation time are evaluated on 200 200 200 sampled vehicle poses and are given with mean and standard deviation.

	scenarios	$G$ [ $%$ ]	$D$ [ $-$ ]	time [ $ms$ ]
uniform	all	13.5 $\pm$ 6.400.0	7.84 $\pm$ 0.44	0.1 $\pm$ 0.0
OSE	all	15.2 $\pm$ 17.4	0.57 $\pm$ 0.30	39.7 $\pm$ 43.1
CNN¹	all	10.4 $\pm$ 12.5	0.34 $\pm$ 0.39	41.7 $\pm$ 14.0
CNN²	all	11.8 $\pm$ 14.1	0.35 $\pm$ 0.40	39.7 $\pm$ 14.5
CNN³	all	11.9 $\pm$ 14.3	0.34 $\pm$ 0.37	37.3 $\pm$ 13.7
CNN⁴	all	17.8 $\pm$ 19.6	0.44 $\pm$ 0.35	37.4 $\pm$ 14.0
CNN⁵	all	11.4 $\pm$ 14.0	0.32 $\pm$ 0.33	35.7 $\pm$ 14.5
CNN⁶	all	12.0 $\pm$ 14.8	0.32 $\pm$ 0.33	38.8 $\pm$ 14.1
CNN⁷	all	32.6 $\pm$ 22.4	0.82 $\pm$ 0.55	38.4 $\pm$ 13.4
CNN¹	arena	14.4 $\pm$ 16.2	0.50 $\pm$ 0.47	42.5 $\pm$ 25.2
CNN¹	parking	9.1 $\pm$ 9.6	0.23 $\pm$ 0.15	40.5 $\pm$ 25.1

Table 4. Table IV : Parameters of the ego-vehicle.

parameter	value
length	$4.926 m$
width	$2.086 m$
wheel base	$2.912 m$
maximum curvature	$0.1982 m^{- 1}$
maximum curvature rate	$0.1868 m^{- 2}$

Table 5. Table V : Motion planning results after 100 100 100 runs of the same experiment in the three described scenarios. The table displays the comp. time for 100 100 100 pose predictions, the time-to-first-solution TTFS, the number of vertices in the tree, the number of cusps, the path length, the cost of the first solution cost TTFS subscript cost TTFS \text{cost}_{\text{TTFS}} , the cost after 3 s times 3 second 3\text{\,}\mathrm{s} of optimization cost TTFS+3s subscript cost TTFS+3s \text{cost}_{\text{TTFS+3s}} , and the success rate. Values given with mean and standard deviation are highlighted with a plus-minus sign.

scenario	heuristic	steer. fct.	pred. [ $ms$ ]	$TTFS [s]$	#vertices $[-]$	#cusps $[-]$	length $[m]$	${cost}_{TTFS}$ $[-]$	${cost}_{TTFS+3s}$ $[-]$	success $[%]$
I	-	HC⁰⁰-RS	-	0.57 $\pm$ 0.41	79.2 $\pm$ 14.3	4.0 $\pm$ 1.2	35.2 $\pm$ 4.8	162.0 $\pm$ 48.3	124.1 $\pm$ 28.2	100
	OSE		127.6	1.54 $\pm$ 2.15	139.2 $\pm$ 27.6	4.6 $\pm$ 1.6	38.7 $\pm$ 7.0	160.7 $\pm$ 50.4	145.1 $\pm$ 42.4	89
	CNN		82.8	0.13 $\pm$ 0.07	140.6 $\pm$ 7.5	2.2 $\pm$ 0.8	28.1 $\pm$ 3.9	134.4 $\pm$ 37.2	93.4 $\pm$ 18.9	100
II	-	CC⁰⁰-Dub.	-	3.51 $\pm$ 2.31	127.0 $\pm$ 12.6	0.0 $\pm$ 0.0	25.7 $\pm$ 0.2	63.4 $\pm$ 16.4	60.6 $\pm$ 15.9	82
	OSE		17.4	0.12 $\pm$ 0.14	187.2 $\pm$ 12.0	0.0 $\pm$ 0.0	25.7 $\pm$ 0.1	50.1 $\pm$ 11.2	37.4 $\pm$ 2.9	100
	CNN		82.4	0.11 $\pm$ 0.03	221.2 $\pm$ 8.0	0.0 $\pm$ 0.0	25.7 $\pm$ 0.1	43.7 $\pm$ 9.4	33.6 $\pm$ 1.9	100
III	-	HC⁰⁰-RS	-	2.92 $\pm$ 2.53	96.1 $\pm$ 15.7	3.6 $\pm$ 1.5	20.7 $\pm$ 14.3	152.4 $\pm$ 51.0	148.2 $\pm$ 51.9	91
	OSE		19.0	0.02 $\pm$ 0.03	154.6 $\pm$ 6.0	3.0 $\pm$ 1.0	12.8 $\pm$ 1.5	143.6 $\pm$ 21.0	119.2 $\pm$ 16.3	100
	CNN		80.9	0.09 $\pm$ 0.02	157.9 $\pm$ 4.6	2.0 $\pm$ 0.0	13.0 $\pm$ 0.6	100.4 $\pm$ 23.5	84.0 $\pm$ 3.7	100

Equations13

L = c \in C \sum (f_{CE} (c) L_{CE} (c) + f_{MSE} (c) L_{MSE} (c)) + w \in W \sum \frac{λ w ^{2}}{2},

L = c \in C \sum (f_{CE} (c) L_{CE} (c) + f_{MSE} (c) L_{MSE} (c)) + w \in W \sum \frac{λ w ^{2}}{2},

f_{CE} (c)

f_{CE} (c)

f_{MSE} (c)

D (X, X_{pred}) = \frac{\sum _{i = 1}^{N} min _{x_{t} \in X} d ( x _{t} , x _{pred}^{[i]} )}{N},

D (X, X_{pred}) = \frac{\sum _{i = 1}^{N} min _{x_{t} \in X} d ( x _{t} , x _{pred}^{[i]} )}{N},

d (x_{t}, x_{pred}^{[i]}) = w_{pos} x_{pred}^{[i]} - x_{t}_{2} + w_{θ} θ_{pred}^{[i]} - θ_{t},

d (x_{t}, x_{pred}^{[i]}) = w_{pos} x_{pred}^{[i]} - x_{t}_{2} + w_{θ} θ_{pred}^{[i]} - θ_{t},

x_{ref}^{[i]}

x_{ref}^{[i]}

G (X, X_{ref}) = \frac{max _{i = 1 : N - 1} s _{ref}^{[i + 1]} - s _{ref}^{[i]}}{s _{T}},

G (X, X_{ref}) = \frac{max _{i = 1 : N - 1} s _{ref}^{[i + 1]} - s _{ref}^{[i]}}{s _{T}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · Autonomous Vehicle Technology and Safety · Robot Manipulation and Learning

Full text

Learning to Predict Ego-Vehicle Poses for Sampling-Based Nonholonomic Motion Planning

Holger Banzhaf1, Paul Sanzenbacher2, Ulrich Baumann2, J. Marius Zöllner3 1Holger Banzhaf is with Robert Bosch GmbH, Corporate Research, Automated Driving, Renningen, Germany.2Paul Sanzenbacher and Ulrich Baumann were with Robert Bosch GmbH, Corporate Research, Automated Driving, Renningen, Germany.3J. Marius Zöllner is with FZI Research Center for Information Technology, Karlsruhe, Germany.

Abstract

Sampling-based motion planning is an effective tool to compute safe trajectories for automated vehicles in complex environments. However, a fast convergence to the optimal solution can only be ensured with the use of problem-specific sampling distributions. Due to the large variety of driving situations within the context of automated driving, it is very challenging to manually design such distributions. This paper introduces therefore a data-driven approach utilizing a deep convolutional neural network (CNN): Given the current driving situation, future ego-vehicle poses can be directly generated from the output of the CNN allowing to guide the motion planner efficiently towards the optimal solution. A benchmark highlights that the CNN predicts future vehicle poses with a higher accuracy compared to uniform sampling and a state-of-the-art A*-based approach. Combining this CNN-guided sampling with the motion planner Bidirectional RRT* reduces the computation time by up to an order of magnitude and yields a faster convergence to a lower cost as well as a success rate of $100\text{\,}\mathrm{\char 37\relax}$ in the tested scenarios.

This work is an extended version of [1] and therefore partially copyrighted by IEEE: © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

I Introduction

Motion planning is one of the major pillars in the software architecture of automated vehicles. Its task is to compute a safe trajectory from start to goal while taking into account the vehicle’s constraints, the non-convex surrounding as well as the comfort requirements of passengers. In structured environments, such as highway driving, it is sufficient to solve the motion planning problem locally close to the lane centerline. Semi-structured environments or complex maneuvers, however, require a global solution as local approaches typically fail. Real-world examples of such scenarios are evasive maneuvers due to blocked roads (see Fig. 1), multi-point turns at dead ends, or various parking problems.

Despite the complexity of such situations, the real-time constraints of automated vehicles still require a fast convergence to a preferably optimal solution. In order to achieve this, a common approach in the literature is the design of hand-crafted heuristics that guide the motion planner towards the goal [2, 3]. However, finding a good heuristic that takes into account the nonholonomic constraints of the vehicle while adapting to the various driving situations is nearly as difficult as solving the motion planning problem itself.

An appealing solution for this dilemma is the combination of classic motion planning techniques [4] with machine learning (ML) algorithms that learn to guide the motion planner in a data-driven way. In such a tandem configuration, for instance, the ML algorithm generates potential solutions for the planning problem while the motion planner selects the best proposals and guarantees collision avoidance.

This paper introduces and evaluates such a combination, in which a neural network guides the sampling-based motion planner Bidirectional RRT* (BiRRT*) [5]. Sampling-based rather than search-based planning [2, 6] is considered here as the latter requires expert knowledge to discretize the state-action space. Especially in tight environments, an improper choice of the discretization resolution directly influences the completeness of the search-based planner. Sampling-based motion planners, on the other hand, require a problem-specific sampling distribution to speed up the planning process. Within this context, the main contributions are:

•

Prediction of future ego-vehicle poses given the current environment as well as the start and goal state using a deep convolutional neural network (CNN). The network is trained end-to-end using trajectories and observations recorded in a real-world driving simulator.

•

Comparison of the proposed method with two baselines: uniform sampling and an A*-based approach.

•

Evaluation of the different sampling distributions along with the generic motion planner BiRRT* in three challenging automated driving scenarios.

The remainder of this paper is organized as follows: Sec. II describes the related work, and Sec. III focuses on the neural network used for the prediction task. The performance of the CNN-guided motion planner is highlighted in Sec. IV, followed by a conclusion in Sec. V. A supplementary video can be found at https://youtu.be/FZPn3OHdQxk.

II Related Work

Recent advances in deep learning have opened up new possibilities to improve or even replace existing motion planning algorithms for robot navigation. For instance, impressive results have been achieved in the field of imitation learning, where a robot is trained to act based on expert demonstrations, such as end-to-end learning for self-driving cars. Here, deep neural networks have shown the capability to learn reactive policies given high-dimensional sensor inputs [7, 8]. However, these approaches cannot guarantee safety and might fail in situations where a global solution of the motion planning problem is required.

The shortcomings of pure ML-based solutions can be avoided by replacing only non-safety-relevant parts of existing motion planners with ML, and hence maintain crucial properties like safety and optimality. Depending on the planning algorithm, different approaches exist in the literature. Optimization-based planners, for example, often suffer from short planning horizons. This issue can be resolved with a learned cost shaping term that aligns the short-term horizon with the long-term goal as proposed in [9].

In contrast to that, search-based planners in continuous state-action space require an action sampling distribution for graph expansion and a cost-to-go heuristic for node selection. The former is addressed in [10], where a generative adversarial network [11] is trained to guide the search towards the goal. Minimizing search effort through learned cost-to-go heuristics is achieved in [12]. However, iteratively evaluating the heuristic during search limits the capacity of the neural network due to real-time constraints. Furthermore, optimality can only be guaranteed in a multi-heuristic setup.

Sampling-based motion planners, on the other hand, require problem-specific sampling distributions for fast convergence to the optimal solution. A promising research direction is to learn these distributions using imitation learning. For instance, [13] integrates end-to-end learning in a sampling-based motion planner. A conditional variational auto-encoder [14] is trained in [15] to learn a sampling distribution of arbitrary dimension. While this approach seems generally promising, scaling it up to high-dimensional environment models, such as occupancy grids, remains an unresolved challenge. Opposed to that, [16] uses a CNN to predict future positions of the robot given a start and goal position as well as the current environment. The CNN’s output is then used to bias the sampling of RRT* [17]. A similar idea is presented in this paper with the major difference that not only position, but entire pose predictions (position and orientation) are learned for nonholonomic motion planning.

Recent publications [18, 19, 20] have shown novel motion planning algorithms that conduct planning in a learned latent space rather than in the complex configuration space. The general idea is to simplify the planning problem by solving it in the lower-dimensional latent space. While still in an early development phase, the future of these approaches in safety-critical applications highly depends on the possibility to satisfy hard constraints like collision avoidance.

III Learning Ego-Vehicle Pose Predictions

This section introduces the model for the prediction of the ego-vehicle poses connecting a start with a goal state in an environmental aware fashion. Sections III-A, III-B and III-C highlight the data generation process, the model, and the sampling of vehicle poses from the model’s output. The training process, hyperparameter optimization, and evaluation are finally described in Sections III-D, III-E and III-F.

III-A Data Generation

Learning ego-vehicle predictions in a supervised fashion requires a diverse dataset consisting of the vehicle’s trajectory from start to goal and the corresponding observations of the environment. Such a dataset with a total number of $13\,418$ trajectories is recorded in a real-world simulator using Gazebo and ROS. The implemented data generation process is fully automated and parallelized with multiple simulation instances, and requires no time-consuming manual labeling of the recorded training data.

Each recording is generated in one of the eleven scenarios visualized in Fig. 2.

As this research is embedded in a project with a focus on low-speed maneuvering in tight environments, the designed scenarios focus on challenging setups that require the vehicle to act precisely in a highly non-convex workspace. Exemplary situations include blocked roads, dead ends, or different parking problems. Variation is introduced in each scenario by changing the start and goal pose of the ego-vehicle as well as the number and location of the static obstacles. For unstructured environments, this is implemented by randomly sampling the obstacles and the vehicle’s start and goal pose. For semi-structured environments, obstacles are randomly assigned to predefined locations, such as parking spots. The ego-vehicle’s start and goal pose are, in this case, randomly chosen within predefined locations corresponding to the entrance and exit of a scenario, the driveway, or a parking spot.

The following procedure is then applied to obtain a recording in the randomly configured instances of each scenario. First, the motion planner BiRRT* generates a curvature-continuous collision-free path from start to goal as detailed in Sec. IV. Here, BiRRT* is guided by the A*-based heuristic described in Sec. III-F, and its initial motion plan is optimized for $5\text{\,}\mathrm{s}$ . Next, a motion controller executes the computed solution resulting in a trajectory with $t=1,\ldots,T$ vehicle states $\mathbf{x}_{t}\in\mathcal{X}$ , where a state $\mathbf{x}_{t}=(x_{t},y_{t},\theta_{t},\kappa_{t},v_{t})^{\top}$ at time step $t$ is defined by the vehicle’s position $(x_{t},y_{t})$ , orientation $\theta_{t}$ , curvature $\kappa_{t}$ , and velocity $v_{t}$ . Note that throughout this paper, the term vehicle pose is only used to describe a lower-dimensional subset of the state including the vehicle’s position and orientation. Finally, the observations from a simulated LiDAR perception are fused in a two-dimensional $60\text{\times}60\text{\,}\mathrm{m}$ occupancy grid with a resolution of $10\text{\,}\mathrm{cm}$ and recorded along the trajectory. A visualization of such a recording can be found in Fig. 4 on the left.

It has to be noted that, similar to [12], motion planning for data generation is conducted with full environmental information in order to guarantee fast convergence to a cost-minimizing solution. In contrast to that, the recorded occupancy grid only fuses the current history of sensor observations resulting in unobserved areas due to occlusions. This forces the model in the learning phase to also resolve situations with partial observability that requires an intuition where possible solutions might lie.

III-B Model

The model is designed to generate a sampling distribution over future vehicle poses connecting a start and goal state given an occupancy grid with environmental observations. Previous publications [16, 21, 22] have shown that CNNs are well suited for processing high dimensional inputs and predicting multi-modal distributions over future ego-vehicle locations. For nonholonomic motion planning, however, it is essential to enrich such a spatial distribution with the information about the robot’s heading angle at all predicted positions. Therefore, it is proposed to jointly learn a sampling distribution over future vehicle positions as well as a mapping from position to heading angle for a complete representation of a vehicle pose.

This task is realized with the encoder-decoder architecture shown in Fig. 3. The illustrated CNN, which contains about $2.5$ million parameters, is based on the well-known SegNet [23] from semantic segmentation. The main contribution is not the architecture of the model itself, as it can be easily exchanged with any other state-of-the-art architecture, but the representation of the input and output layers.

The proposed network takes five grids with a resolution of $256\text{\times}256\text{\,}\mathrm{px}$ as an input. These grids encode the static obstacles, the unknown environment, the past path, and the start and goal state of the vehicle (see Fig. 4).

A simple and effective encoding of the vehicle state is achieved by a $7\text{\times}7\text{\,}\mathrm{px}$ square that describes three information: (1) its location in the grid marks the position of the vehicle, (2) its inner pixels encode the corresponding vehicle velocity, and (3) its outer pixels depict the respective heading angle. Note that the curvature of the start state is implicitly encoded in the past path as it can be seen in Fig. 3 at the top left.

The input of the CNN is first encoded and then decoded back to its original size using alternating blocks of convolutional layers with $3\text{\times}3$ kernels and a stride of $1$ , $2\text{\times}2$ max pooling layers, and $2\text{\times}2$ unpooling layers. All convolutional layers except the last one are followed by a batch normalization [24] and a ReLU activation [25].

The CNN outputs four grids of the same resolution as the input grids. The first two grids give the results of a cell-wise classification that is trained to predict whether a given cell belongs to the future path or not. An example of such a prediction is shown in Fig. 3 at the top right, where the intensity corresponds to the probability $p_{\text{path}}(c)$ of a cell $c$ being part of the future path. In contrast to that, the last two output grids contain the results of a cell-wise regression for the sine and cosine components of the robot’s heading angle. The decomposition into sine and cosine has three advantages: (1) a cell with zeros in both components represents an invalid orientation and can therefore be used to label cells without angle information (see Fig. 4), (2) computing the cell-wise norm of the predicted components can be interpreted as a confidence measure $p_{\theta}(c)$ indicating if an angle information is available at a respective cell $c$ (see Fig. 3 at the bottom right), and (3) the prediction of the heading angle can be treated as a regression task rather than a classification task, which yields a compact representation of the output and avoids an exponential growth of its dimension. However, the latter comes with a potential drawback of not being able to predict multi-modal, cell-wise heading angle predictions. The experiments have shown, however, that this is only an issue in rare corner cases as most of the scenarios do not require the vehicle to change its heading angle significantly while staying in the same region of the environment.

III-C Vehicle Pose Sampling

Generating $i=1,\ldots,N$ continuous vehicle poses $\mathbf{x}_{\text{pred}}^{[i]}$ from the output of the CNN is conducted in four steps. First, $N$ random cells $c_{\text{pred}}^{[i]}$ are sampled from the CNN’s path prediction $p_{\text{path}}(c)$ , which can be seen as a probability mass function (pmf) that describes the relative probability of a cell $c$ being part of the future path (see Fig. 3). Sampling from the pmf can be realized efficiently using the low-variance sampling algorithm in [26]. Its linear time complexity yields a fast computation even for large values of $N$ . Next, a continuous position prediction $(x_{\text{pred}}^{[i]},y_{\text{pred}}^{[i]})$ is obtained by sampling uniformly within the previously computed discrete cell $c_{\text{pred}}^{[i]}$ . The heading angle prediction $\theta_{\text{pred}}^{[i]}$ is now evaluated at the cell $c_{\text{pred}}^{[i]}$ . If required, it can be interpolated with respect to the continuous position $(x_{\text{pred}}^{[i]},y_{\text{pred}}^{[i]})$ , which is omitted here for the sake of simplicity. In a final step, the sampled position and the corresponding heading angle are concatenated yielding a sampled vehicle pose $\mathbf{x}_{\text{pred}}^{[i]}$ .

Note that throughout this paper, samples are exclusively drawn from cells with $p_{\text{path}}(c)>$ 0.5$$ in order to only guide the motion planner towards regions with a high probability to be part of the future path. In contrast to that, visualizations of $p_{\text{path}}(c)$ are never thresholded and show the unmodified output of the CNN.

III-D Training and Metrics

The proposed model is trained end-to-end using $64\text{\,}\mathrm{\char 37\relax}$ of the recorded trajectories. The training dataset is further augmented by randomly selecting up to $100$ different start states on a recorded trajectory resulting in $807\,273$ (partly correlated) data points. This augmentation strategy does not only upscale the dataset, but also forces the CNN to adjust its prediction as more information on the vehicle’s environment becomes available.

The parameters of the network are optimized using Adam [27] and the loss function

[TABLE]

where the first term computes the cell-wise cross-entropy loss $L_{\text{CE}}(c)$ of the classification task, the second term the cell-wise mean squared error $L_{\text{MSE}}(c)$ of the regression task, and the last term the L2 regularization loss of the weights $w\in\mathcal{W}$ with a scaling factor $\lambda$ . As the majority of cells in the label grids contain no path information (see Fig. 4), a weighting of the relevant cells is conducted with the functions $f_{\text{CE}}(c)$ and $f_{\text{MSE}}(c)$ given as

[TABLE]

where $\gamma_{\text{CE}}$ and $\gamma_{\text{MSE}}$ are hyperparameters of the model, and $\operatorname{\mathds{1}}_{c}$ is the indicator function that is $1$ if the future path crosses a cell $c$ and [math] otherwise.

In order to evaluate the prediction capability of the model, two metrics are introduced below. The first metric measures the proximity of the $N$ predicted samples $\mathbf{x}_{\text{pred}}^{[i]}\in\mathcal{X}_{\text{pred}}$ to the ground truth trajectory $\mathbf{x}_{t}\in\mathcal{X}$ by computing the average path deviation $D$ according to

[TABLE]

where $d(\bullet)$ describes the distance between a pose on the trajectory and a sample. It is computed by a weighted sum of the Euclidean distance and the angular deviation given as

[TABLE]

where the different units of both terms are taken into account by setting $w_{\text{pos}}$ to $0.35$ and $w_{\theta}$ to $0.65$ throughout this paper.

As the predicted poses are supposed to guide the motion planner through complex scenarios, it is beneficial to have evenly distributed samples along the ground truth trajectory. Therefore, the second metric measures the maximum prediction gap $G$ in the following two steps: (1) project every sample onto the trajectory, and (2) evaluate the maximum arc length that is not covered by any sample. The projection in step 1 is defined as

[TABLE]

where $\mathbf{x}_{\text{ref}}^{[i]}$ is the reference pose of the predicted sample $\mathbf{x}_{\text{pred}}^{[i]}$ on the trajectory. These reference poses are then added to the ordered list $\mathcal{X}_{\text{ref}}$ according to their arc length $s_{\text{ref}}^{[i]}$ . The maximum prediction gap $G\in[0,1]$ is finally obtained by

[TABLE]

where the length of the recorded trajectory is denoted by $s_{T}$ .

III-E Hyperparameter Optimization

This subsection evaluates the influence of the hyperparameters on the CNN’s prediction performance based on the previously derived metrics. The analysis is conducted on a validation dataset containing $16\text{\,}\mathrm{\char 37\relax}$ of the recorded trajectories. Similar to the training dataset, up to five different start states are selected on each trajectory resulting in $10\,730$ (partly correlated) data points.

The different parametrizations of the model are trained on an Nvidia Titan X with a batch size of $20$ using the Python interface of TensorFlow. Training a model to convergence takes about $30\text{\,}\mathrm{h}$ , which corresponds to $90\,000$ iterations. A visualization of the training statistics for one of the conducted optimizations can be found in Fig. 5.

In order to decrease the exponentially growing search effort, only the hyperparameters with the highest expected influence on the CNN’s overall performance are optimized. Therefore, two out of the five hyperparameters, namely the exponential learning rate decay $lrd$ and the L2 regularization scaling factor $\lambda$ , are fixed to $lrd=$ 0.01 $after ${10}^{6}$ iterations and $\lambda=$0.003$ . The remaining hyperparameters are optimized in two steps. First, the learning rate $lr$ is varied while keeping $\gamma_{\text{CE}}$ and $\gamma_{\text{MSE}}$ at $25$ , and second, $lr$ is set to its best value while $\gamma_{\text{CE}}$ and $\gamma_{\text{MSE}}$ are optimized.

The quantitative results of the hyperparameter optimization are listed in Tab. I.

It can be seen that lower learning rates result in a better prediction performance as they decrease the maximum prediction gap metric $G$ as well as the average path deviation metric $D$ . The insights from [21] that higher values for $\gamma_{\text{CE}}$ incentivize the CNN to classify more cells as part of the future path can also be observed in Tab. I and Fig. 6. Thus, increasing $\gamma_{\text{CE}}$ reduces the maximum prediction gap $G$ while raising the average path deviation $D$ because more samples are placed further away from the ground truth. A potential risk of too large values for $\gamma_{\text{CE}}$ is a deterioration of the heading angle prediction, which cannot be recovered by increasing $\gamma_{\text{MSE}}$ (see Tab. I). For the remaining evaluations, the model with $\gamma_{\text{CE}}=$ 25 $, $\gamma_{\text{MSE}}=$25$ , and $lr=$ {10}^{-5}$$ is selected as this combination yields a smooth distribution of the samples in close vicinity of the ground truth.

III-F Evaluation

The following paragraphs analyze the prediction capability of the CNN and benchmark its performance against uniform sampling and the A*-based approach called Orientation-Aware Space Exploration (OSE) [3]. To the best of the authors’ knowledge, the latter is currently one of the most effective approaches for guiding a motion planner through complex environments. It is based on an A* graph search and explores the workspace with oriented circles as illustrated in Fig. 7.

Discrete vehicle poses can then be obtained by sampling from three-dimensional Gaussian distributions located at the center of the circles. The standard deviation of the Gaussians can be made dependent of the radius $r$ and is set to $\sigma_{x}=\sigma_{y}=r/3$ for the translational and $\sigma_{\theta}=\pi/6$ for the rotational component, respectively. Additional parameters of the OSE heuristic used in this paper are listed in Tab. II.

A benchmark of the three approaches on the test dataset can be found in Tab. III. The $13\,420$ test cases with up to five different start states on each trajectory represent the remaining $20\text{\,}\mathrm{\char 37\relax}$ of the recordings. For a scenario-specific evaluation, $5770$ trajectories are extracted from this dataset, half of which come from the scenario arena and the other half from parallel and perpendicular parking. Both uniform sampling as well as the OSE procedure are evaluated on a single core of an Intel Xeon E5@ $3.5\text{\,}\mathrm{GHz}$ with the latter being implemented in C++. As opposed to that, the CNN is executed with its Python pipeline on an Nvidia Titan X and an Intel Xeon E5@ $3.1\text{\,}\mathrm{GHz}$ .

In comparison to uniform sampling and the OSE approach, Tab. III shows that the CNN1 predicts the vehicle poses more evenly along the ground truth path and yields overall smaller deviation from it. The mean computation time of the CNN is comparable to the OSE heuristic, however, with a one-third smaller standard deviation. The OSE’s high variance in computation time is due to the fact that the performance of graph search-based algorithms highly depends on the complexity of the environment. This also causes the problem that a solution might not be found before the timeout is reached, which occurred here in $3.4\text{\,}\mathrm{\char 37\relax}$ of all test cases. In summary, the CNN, whose output is qualitatively visualized in Fig. 8, makes its predictions more reliably (no outages) with a lower latency. Both aspects are key features in safety-critical applications such as automated driving.

In order to better understand the effect of the different input grids on the performance of the CNN, an ablation study has been conducted. The results in Tab. III show that removing features from the CNN’s input causes a deterioration of at least one of the analyzed metrics. Furthermore, it can be seen that excluding the goal or the observations causes the greatest decline in performance. In contrast to that, the metrics only change slightly if the CNN is trained and evaluated without the obstacle or the unknown grid (not both). One of the reasons for this is that in many cases, both grids are complementary in the sense that the unknown grid allows to infer the obstacles and vice versa (see Fig. 4).

The scenario-specific benchmark in Tab. III highlights that the CNN’s performance in the parking scenario is almost a factor two better compared to the maze-like structure arena. The resulting insight is that learning stationary sampling distributions in completely unstructured environments is a much harder task for the network than in semi-structured environments. This can also be seen in Fig. 8 on the bottom right, where the CNN prediction degenerates due to the complexity of the scenario. Possible reasons for this are longer prediction horizons, a larger variety of feasible maneuvers, and potentially heavier occlusions. It is left for future work to determine which network structure is the most suitable one for such challenging environments.

A visualization of the CNN prediction in two novel scenarios, namely a construction zone and and a cluttered 4-way intersection, can be found in the lower left images of Fig. 8. Both scenarios were not seen during training and contain novel artifacts such as construction barrels and rectangular dumpsters. While the construction zone only requires the vehicle to follow the lane, the maneuver at the cluttered 4-way intersection consists of three steps: (1) exit the tight parking spot, (2) avoid the dumpster at the side of the road, and (3) make a turn at the intersection. The illustrated predictions showcase the models capability to deal with so far unseen artifacts as well as to generalize to novel scenarios. Only in the second case, the initial path prediction overlaps with the barriers in front of the vehicle. However, this is not a safety issue if the predictions are combined with a motion planner as highlighted in the next section.

IV Guided Motion Planning

This section evaluates the performance gain of the sampling-based motion planner BiRRT* guided by the pose predictions of the CNN. As before, the results are compared with purely uniform sampling and the OSE approach. The benchmark is conducted using Gazebo and ROS in three challenging automated driving scenarios that were excluded from the dataset in the previous section. As it can be seen in Fig. 9, these scenarios differ from the ones used to train the CNN (see Fig. 2) with respect to the spatial arrangement of the objects. Additionally, they contain novel artifacts like traffic cones and previously unseen vehicle geometries.

The parameters of the ego-vehicle are based on a full-size car and can be found in Tab. IV. Both the maximum curvature and the maximum curvature rate already include $10\text{\,}\mathrm{\char 37\relax}$ control reserve for closed-loop execution.

A $60\text{\times}60\text{\,}\mathrm{m}$ occupancy grid with a resolution of $10\text{\,}\mathrm{cm}$ is used to represent the static environment. Collision checks are executed every $10\text{\,}\mathrm{cm}$ using a convex polygon approximation of the vehicle’s contour. The $20$ vertices of that polygon are inflated by $10\text{\,}\mathrm{cm}$ serving as a hard safety buffer. An additional soft safety margin is introduced by converting the occupancy grid into a cost map with a $25\text{\,}\mathrm{cm}$ inflation radius as illustrated in Fig. 9.

In order to compute an initial motion plan, BiRRT* is executed for maximal $10\text{\,}\mathrm{s}$ . Another $3\text{\,}\mathrm{s}$ are assigned for optimization if an initial solution is found. Otherwise, failure is reported for the corresponding run. Note that all computations are executed together with the simulation on an Intel Xeon E5@ $3.5\text{\,}\mathrm{GHz}$ and an Nvidia Quadro K2200, which raises the CNN’s computation time by a factor of two compared to the previous benchmark (see Tab. III and Tab. V).

Uniform sampling is conducted in the $60\text{\times}60\text{\,}\mathrm{m}$ region of the occupancy grid with a goal sampling frequency of $5\text{\,}\mathrm{\char 37\relax}$ . For guided motion planning, heuristic samples are generated in batches of $100$ at a frequency of $4\text{\,}\mathrm{Hz}$ . Each batch of samples is then evenly mixed with uniform samples such that the theoretical guarantees of BiRRT* are not violated.

The sampled vehicle poses are connected using the continuous curvature steering functions CC00-Dubins [29] in Scenario II and HC00-Reeds-Shepp [30] in Scenario I/III, where the superscript denotes the initial and final curvature of the steering procedure. The major difference between the two steering functions is that the first one constrains the vehicle to driving forwards only, while the second one allows moving forwards and backwards.

A summary of the experimental results is shown in Tab. V, where each setup is repeated $100$ times to compensate for randomization effects of the planner. It can be observed that BiRRT* in combination with the CNN-based predictions clearly outperforms the two other approaches. The CNN guides the motion planner to the best initial solution and helps to converge to the lowest cost. This is also highlighted in Fig. 10, which illustrates the convergence rate of the different approaches in the tested scenarios. In addition to that, guiding BiRRT* with the CNN is the only approach with a success rate of $100\text{\,}\mathrm{\char 37\relax}$ in all three scenarios. Compared to uniform sampling, the samples from the CNN not only stabilize the time-to-first solution (TTFS), but also reduce its mean by up to an order of magnitude. This value is currently limited by the network’s inference time and can even be further reduced by a more advanced implementation of the prediction pipeline.

Guiding the motion planner with the OSE heuristic yields a low computation time in Scenario II and III, where the circle expansion gives a good approximation of the planned path. As opposed to that, the performance deteriorates significantly in Scenario I, where no solution can be found in $11$ out of $100$ runs. This is due to the fact that the OSE approach assumes a circular holonomic robot and only considers the nonholonomic constraints using cost terms [3]. As a result, the OSE proposes an immediate turnaround maneuver in Scenario I, where an evasive multi-point turn would be required to resolve this situation (see Fig. 9).

V Conclusion

The complexity of motion planning combined with the real-time constraints of automated vehicles require heuristics that guarantee a fast convergence to a cost-minimizing solution. Within this context, a convolutional neural network (CNN) is proposed that predicts ego-vehicle poses from a start to a goal state while taking into account the current surrounding. A benchmark on a recorded dataset highlights the CNN’s capability to predict path corridors with a higher accuracy compared to two baselines: uniform sampling and a state-of-the-art A*-based approach. Furthermore, it is demonstrated that the CNN has the capability to adapt its prediction to novel scenarios with previously unseen artifacts. Together with the sampling-based motion planner BiRRT*, this results in a significant reduction of computation time to about $100\text{\,}\mathrm{ms}$ , high-quality paths, and success rates of $100\text{\,}\mathrm{\char 37\relax}$ in three challenging automated driving scenarios. In conclusion, the proposed method is especially suitable for real-time motion planning in complex environments.

VI Acknowledgement

The authors thank Florian Faion for valuable discussions and insights.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Banzhaf et al. , “Learning to Predict Ego-Vehicle Poses for Sampling-Based Nonholonomic Motion Planning,” IEEE Robotics and Automation Letters , 2019, DOI 10.1109/LRA.2019.2893975.
2[2] D. Dolgov et al. , “Path Planning for Autonomous Vehicles in Unknown Semi-structured Environments,” The International Journal of Robotics Research , 2010.
3[3] C. Chen et al. , “Path Planning with Orientation-Aware Space Exploration Guided Heuristic Search for Autonomous Parking and Maneuvering,” in IEEE Intelligent Vehicles Symposium , 2015.
4[4] D. González et al. , “A Review of Motion Planning Techniques for Automated Vehicles,” IEEE Transactions on Intelligent Transportation Systems , 2016.
5[5] M. Jordan and A. Perez, “Optimal Bidirectional Rapidly-Exploring Random Trees,” MIT, Tech. Rep., 2013, TR 2013-021.
6[6] M. Likhachev and D. Ferguson, “Planning Long Dynamically-Feasible Maneuvers for Autonomous Vehicles,” The International Journal of Robotics Research , 2009.
7[7] D. A. Pomerleau, “ALVINN: An Autonomous Land Vehicle In a Neural Network,” in Advances in Neural Information Processing Systems , 1989.
8[8] M. Bojarski et al. , “End to End Learning for Self-Driving Cars,” ar Xiv preprint ar Xiv:1604.07316 , 2016.