Attention-based Lane Change Prediction
Oliver Scheel, Naveen Shankar Nagaraja, Loren Schwarz, Nassir Navab,, Federico Tombari

TL;DR
This paper introduces an attention-based recurrent model for lane change prediction that emphasizes both accuracy and interpretability, incorporating new metrics to assess driver discomfort, with promising results on multiple datasets.
Contribution
The paper presents a novel attention-based recurrent model that improves interpretability and prediction accuracy for lane change prediction tasks.
Findings
Encouraging results on publicly available dataset
Effective modeling of corner and failure cases
Introduction of metrics reflecting driver discomfort
Abstract
Lane change prediction of surrounding vehicles is a key building block of path planning. The focus has been on increasing the accuracy of prediction by posing it purely as a function estimation problem at the cost of model understandability. However, the efficacy of any lane change prediction model can be improved when both corner and failure cases are humanly understandable. We propose an attention-based recurrent model to tackle both understandability and prediction quality. We also propose metrics which reflect the discomfort felt by the driver. We show encouraging results on a publicly available dataset and proprietary fleet data.
| Frame-based | Maneuver-based | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | B4C | Proposed | |||||||||||
| L | F | R | F1 | TTM | L | F | R | ||||||
| Algorithm | Miss | Delay | Over | Freq | Miss | Delay | Over | Total Rank | |||||
| NB | 0.715 | 0.886 | 0.679 | 0.0 | 0.0 | 0.002 | 0.269 | 0.64 | 7.271 | 0.003 | 0.295 | 0.595 | 5 |
| RF | 0.744 | 0.938 | 0.67 | 0.767 | 0.965 | 0.003 | 0.231 | 0.636 | 7.231 | 0.006 | 0.269 | 0.513 | 4 |
| SRNN | 0.555 | 0.905 | 0.475 | 0.475 | 1.337 | 0.212 | 0.22 | 0.521 | 6.686 | 0.121 | 0.355 | 0.4 | 6 |
| LSTM | 0.772 | 0.958 | 0.634 | 0.822 | 1.086 | 0.002 | 0.215 | 0.671 | 4.749 | 0.007 | 0.341 | 0.525 | 2 |
| LSTM-E | 0.759 | 0.962 | 0.603 | 0.813 | 1.093 | 0.003 | 0.228 | 0.666 | 4.327 | 0.012 | 0.363 | 0.493 | 3 |
| LSTM-A | 0.784 | 0.951 | 0.662 | 0.802 | 1.138 | 0.003 | 0.207 | 0.694 | 5.34 | 0.01 | 0.306 | 0.547 | 1 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Attention-based Lane Change Prediction
Oliver Scheel1,2,∗, Naveen Shankar Nagaraja1,∗, Loren Schwarz1, Nassir Navab2, Federico Tombari2
- indicates equal contribution1Oliver Scheel, Naveen Shankar Nagaraja and Loren Schwarz are with BMW Group, 80788 München, Germany, {oliver.scheel, naveen-shankar.nagaraja, loren.schwarz}@bmw.de2Federico Tombari and Nassir Navab are with the Faculty of Computer Science, Technische Universität München, 85748 Garching bei München, Germany, {tombari, navab}@in.tum.de
Abstract
Lane change prediction of surrounding vehicles is a key building block of path planning. The focus has been on increasing the accuracy of prediction by posing it purely as a function estimation problem at the cost of model understandability. However, the efficacy of any lane change prediction model can be improved when both corner and failure cases are humanly understandable. We propose an attention-based recurrent model to tackle both understandability and prediction quality. We also propose metrics which reflect the discomfort felt by the driver. We show encouraging results on a publicly available dataset and proprietary fleet data.
I INTRODUCTION
Artificial intelligence is commonly seen as the key enabler for fully autonomous driving. Sensing and Mapping, Perception, and (Path) Planning are often seen as the building blocks of any non-end-to-end autonomous system. The rise of deep learning has led to an unprecedented progress in Mapping and Perception. However, path planning has a hybrid nature - it tends to be model-driven with some sub-components learned using deep learning. This is primarily due to the severely complex interaction of different agents (static and dynamic) and prior knowledge (map and traffic information). Dearth of data which includes various corner cases further limits completely data-driven based planning.
Prediction is a crucial part of autonomous driving, serving as a ‘lego block’ for tasks like Path Planning, Adaptive Cruise Control, Side Collision Warning, etc. In this work, we address the problem of predicting lane changes of vehicles. This is of paramount importance, as around 18% of all accidents happen during lane change maneuvers [1], and lane changes are often executed in high-velocity situations, e.g. on highways. A precise prediction thus decreases risk and enables safer driving. This safety gain stemming from a sensitive prediction is one side of the coin. On the other hand, though, false predictions have to be avoided as they have a negative influence on driver comfort. Each false prediction results in unnecessary braking or acceleration.
For predicting lane changes, several “classical” models, like Support Vector Machines (SVMs) [2] or Random Forests [3], have been proposed, with only one recurrent neural net having been published recently [4]. These classical methods, though theoretically sound, see maneuver prediction as function estimation. Though the weights on different features can give us a hint as to what the function considers important, understanding these models when prior knowledge is also given as input has lacked clarity in analysis. The question we ponder over is: does/can a system see what a human looks at?, e.g. when one approaches a highway entry ramp the probability of a lane change for vehicles on the ramp is higher, and the human driver slows down with this prior knowledge (see Fig. 1).
To answer the above intriguing question, we:
(a) propose the first recurrent neural network making use of an attention mechanism over different features and time steps. This model is designed to understand complex situations and also explain its decisions. Like humans it can shift its focus towards certain important aspects of the current scene.
(b) introduce metrics which indirectly reflect driver’s comfort, and thus allow a meaningful quantification of prediction quality.
(c) provide the first comprehensive evaluation of several models aimed at the same task on the same benchmark, and analyze critical corner cases and visually interpret them.
We use the publicly available NGSIM [5] dataset as well as proprietary fleet data (Fig. 2) to demonstrate encouraging results w.r.t. state-of-the-art methods.
II RELATED WORK
Lane change prediction, being a fundamental building block for any autonomous driving task, is a hot topic in research and has been investigated for several years [6, 7, 8, 9, 10]. Picking the most informative features according to a criterion and then using “classical” methods, like SVMs or Random Forests [2, 3, 11, 12] contributed to the core of research in lane change prediction.
Schlechtriemen et al. [13] analyzed the expressive power of a multitude of features and came to the conclusion that lateral distance to the lane’s centerline, lateral velocity, and relative velocity to the preceding car are the most discriminative features. They introduced two models, a Naive Bayesian approach, and a Hidden Markov Model on top of the Naive Bayesian model, with the vanilla Naive Bayesian approach performing better. In another work Schlechtriemen et al. [3] tackled the problem of predicting trajectories, where they consider lane change prediction as a helping subtask. To achieve better generalization, they fed all the available features to a random forest.
Woo et al. [2] proposed a hand-crafted energy field to model the surroundings of a car for prediction with a custom SVM model. Weidl et al. [14] introduced Dynamic Bayesian Networks for maneuver prediction with input features from different sensors and safety distances to the surrounding vehicles.
A main drawback of the above approaches is the improper handling of the temporal aspect of features. A simple concatenation of features across time loses expressibility in the temporal domain, mainly due to a high degree of correlation in the features. Patel et al. [4] introduced a Structural Recurrent Neural Network for this problem. Three Long Short-Term Memory (LSTM) cells handle the driving and neighbouring lanes, with inputs being the features of the surrounding vehicles in the corresponding lanes as well as features of the target.
Zeisler et al. [15] followed a different scheme by using raw video data instead of high-level features. Lane changes are predicted using optical flow of observed vehicles. General intention prediction is a close relative of maneuver prediction. Jain et al. [16] demonstrated impressive results on predicting driver intentions. The key contribution was to fuse two LSTM cells handling complementary feature spaces.
Attention mechanisms were first introduced in vision and translation tasks with outstanding performance [17, 18, 19]. The key idea is to guide the model towards certain points of the input, such as important image regions for visual tasks, and particularly relevant words in translation. We integrate a temporal attention mechanism into our model which cherry-picks relevant features across a sequence.
III PROBLEM DEFINITION
Our goal is to predict lane change maneuvers of cars surrounding the ego car. Let be a snapshot of the scene at timestep containing vehicles. A prediction algorithm assigns a maneuver label to each of the vehicles present in . Predicting or expresses the algorithm’s belief that a vehicle has started a lane change maneuver to the respective side. Predicting , conversely, implies that a vehicle keeps its current lane.
To obtain a prediction, we use the following features for each of the cars (considered as target vehicle) in :
- •
Target vehicle features: . : target’s lateral distance to its lane’s center line, : lateral velocity, : longitudinal velocity, : lateral acceleration, and : heading angle. These features are computed in Frenet coordinates. 111Coordinate axis is along the target object’s lane center line.
- •
Dynamic environment features, i.e., features of cars surrounding the target: for , in accordance with the definition of Nie et al. [20] (see Fig. 3). Here denotes the temporal distance between the target and car , i.e., the distance divided by the velocity of the trailing car.
- •
Static environment features: static features describe the environment type, e.g. map-based features. In the NGSIM dataset an on-/off-ramp is present, which is integrated as , , . , denote the distance to the nearest on-/ off-ramp respectively. lane is the one hot encoding of the lane identifier.
IV MODEL
We propose two kinds of recurrent networks for maneuver prediction, (a) consisting of multiple LSTM cells, and (b) an attention layer on top of that network. We train the models in a sequence-to-sequence fashion, i.e., at every timestep an output is generated. The input features (, and ) used for our proposed approaches are described in Section III.
IV-A Long Short-Term Memory Network
Our basic LSTM [21] network is inspired by the work of Jain et al. [16]. We use three different LSTM cells (, , ), to process the feature groups (, , ) respectively. This decoupling into separate LSTMs ensures that the intra-group correlation is high but the inter-group correlation is low. We use the following shorthand notation for an LSTM cell:
[TABLE]
where is the input, denotes the hidden state and the memory unit.
The full network can be seen in Fig. 4. Mathematically, the fusion of these LSTMs can be formulated as:
[TABLE]
where ’s are the weight matrices, ’s are bias vectors, is the fusion layer, and is the output layer.
IV-B Attention Network
The idea behind an attention mechanism is to model selective focus, i.e., on certain parts of the input. It mainly consists of a function processing a key (K) and query (Q) to obtain a context vector, which is the accumulation of multiple keys, weighted by their importance w.r.t. the query. We employ two kinds of attention mechanisms, (a) attention over previous time steps, i.e., self-attention [22], and (b) attention over different feature groups. As opposed to traditional attention approaches, the features we use lie in different spaces and have different modalities. We do not accumulate them, but only change their magnitude in accordance with the weighting, and then accumulate these feature vectors over the time steps; see Fig. 5 for an intuitive visualization.
We again partition the features into categories, but with a finer granularity than in Section III, viz. , , , and . The attention function is given by:
[TABLE]
For time step in all calls of , layer serves as query. Let be the time steps used for self-attention; we have used in our experiments. For each the feature categories are embedded into a higher-dimensional space, and the importances of each feature category, , as well as each time step as a whole, , are determined. Let :
[TABLE]
where , . Eventually, the feature categories are scaled by and the weighted sum is calculated over all time steps. The resulting context vector is appended to the fusion layer and the computation follows Eq. 1.
[TABLE]
IV-B1 Visualization of Attention
Apart from improved performance, another large benefit of attention is its interpretability. Traditionally, simply the magnitude of the attention weights, which are used in the calculation of the weighted mean, is shown [18]. Here though, due to the different scales and dimensions of the feature categories, this does not necessarily lead to expected results. Instead, we calculate the derivative of the predicted class by the attention weights and , summing over all time steps. This derivative denotes the contribution of category to the resulting prediction, even providing the information whether this contribution is positive or negative.
IV-C Training Scheme
As proposed in [16], we employ an exponentially growing loss to encourage early predictions. The used Softmax loss is weighted with , where at time a lane change is imminent in the next seconds.222Exponential weighting of the loss function is not done for the fleet data, as the human labels are error-free. We choose s.t. the average value of over all frames of each lane change maneuver equals . For a given maneuver at time , is inversely proportional to that maneuver’s global size in training data.
As noted by Schlechtriemen et al. [13], simple scenarios cover a majority of lane changes, and a relatively good prediction can already be achieved by using a small subset of features from . To tackle this imbalance and induce a meaningful gradient flow for the attention in all cases, we introduce a dropout layer in between layer and , i.e.
[TABLE]
With a probability , and are set to 0 independently, forcing the model to rely solely on its recurrent architecture or attention.
V Datasets and Evaluation
V-A Datasets
NGSIM: The Next Generation Simulation (NGSIM) [5] project consists of four publicly available traffic data sets. We use the US Highway 101 dataset (US-101) and Interstate 80 Freeway dataset (I-80). Data is captured from a bird’s-eye view of the highway with a static camera, and vehicle-related high-level features are extracted from it. The datasets contain measurements at . After removing noisy trajectories, lane changes are observed.
Fleet data: The fleet data comes from the fused perception model of in-production cars. This data is captured at w.r.t. to a moving ego car equipped with several camera and radar sensors to give a complete view. 830 lane changes are recorded.
V-B Metrics
A wide variety of metrics is used to measure the performance of lane change prediction algorithms. Predominantly they are inspired by information retrieval and are computed treating each timestep independently of the other.
- •
Accuracy: percentage of timesteps correctly classified.
Jain et al. [16] introduced a first version of the following maneuver-based metrics:
- •
Precision: percentage of true predictions w.r.t. total number of maneuver predictions.
- •
Recall: percentage of true predictions w.r.t. total number of maneuvers.
- •
Time to Maneuver (TTM): the interval between the time of prediction and the actual start of the maneuver in the ground truth.
For evaluation, we combine Precision and Recall into the F1 score and refer to the metrics introduced by Jain et. al [16] by their associated group’s name Brain4Cars (B4C).
The ground truth labels are event-wise continuous (Fig. 6). The information retrieval metrics, however, do not reflect this event-wise nature or what the driver experiences in the car. The car’s controller usually reacts to the first prediction event (Fig. 7). If the prediction is discontinuous then this causes discomfort to the driver (stop-and-go function). In addition, the prediction event should be as early as possible w.r.t. the ground truth, and the earlier the prediction the higher is the comfort. In order to reflect such comfort-related behaviour we propose the following event-wise metrics:
- •
Delay: delay (measured in seconds) in prediction w.r.t. the ground truth label. If prediction is perfectly aligned with the ground truth then delay is [math].
- •
Overlap: for a given ground truth event the percentage of overlap for the earliest maneuver predicted. The higher the overlap, the smoother is the controller’s reaction.
- •
Frequency: number of times a maneuver event is predicted per ground truth event. For the ‘follow’ event this indicates the false positive rate (FPR).
- •
Miss: number of lane changes completely missed. The higher the number of misses, the higher is the discomfort, as the driver has to intervene.
V-C Labeling
The perception of the precise moment when a lane change starts differs from person to person, see Fig. 8. Therefore, manually labeling lane changes in fleet data gives us a hint at the intention. However, automatic labeling is useful in the case of NGSIM due to a similar time span of lane changes. Thus, like [13] we have used a -second criterion, before the target’s lane assignment changes, to label a lane change. Though human labeling is precise and error-free, it is time-consuming and expensive. Intelligent automatic labeling can be slightly imprecise, but on the other hand, is quicker and might prove to be better for deep models, which could pick up on fine cues imperceptible to humans to achieve a better performance.
VI RESULTS
We denote our two proposed recurrent methods in Section IV as LSTM-E (extended LSTM) and LSTM-A (extended LSTM with attention). For both a hidden size of 128 is used. We implement state-of-the-art baselines to demonstrate better performance of our proposed methods.
VI-A Baseline-Methods
Frame-based: Features from a single timestep are used.
- •
Random Forest (RF) [3]: The concatenated features () serve as input.
- •
Naive Bayes (NB) [13]: The features and relative velocity to preceding car are used.
Sequence-based:
- •
Structural RNN (SRNN) [4]: The SRNN consists of three different LSTM cells which cover the target, left, and right lane respectively. To each LSTM cell the features of three vehicles are given, viz. those of the two neighbors of the target car (PV - RV / - / - ) and the target car itself. consists of absolute world coordinates, lateral and longitudinal velocity, heading angle, and number of lanes to the left and right. The output of the three LSTM cells is passed on to another LSTM cell, which eventually outputs the prediction.
- •
Vanilla LSTM (LSTM): Vanilla LSTM consisting of a single cell with the concatenated features .
VI-B Quantitative Results
Table I shows the results of all tested methods w.r.t. all metrics on the NGSIM and fleet dataset. As can be seen, due to the diversity of the evaluation metrics, some methods excel or fail in different categories. Sequence-based methods easily outperform frame-based methods since the latter carry no information regarding the sequence history. Among sequence-based methods, our three recurrent models, LSTM, LSTM-E, and LSTM-A come out on top (refer to the ‘Total Rank’ column in the table).
On the NGSIM dataset, the LSTM network with attention is the best-performing method. It has the lowest delay while predicting lane changes, a lower false positive rate during ‘follow’, and a good continuous prediction indicated by ‘Overlap’. On our fleet data LSTM-A finishes second. This is mainly due to the sparsity of the dynamic environment features in fleet data. Thus, the prediction falls back to the target features , as these are the most discriminative features, and the performance is similar to vanilla LSTM.
VI-C Qualitative Results
As can be seen from Table I, the performance of some methods is relatively similar. Analyzing and interpreting a few critical corner cases will help in assessing the performance and give us clarity about the advantage of an attention mechanism. These critical corner cases are not present in the data. They were created by translating around the existing trajectories w.r.t. their position in the scene, and thus remain realistic. For the fleet data we do not have the static data recording, but instead a moving ego car (drawn in red) from which the measurements of the scene are obtained.
Two types of visualizations are used: (a) a snapshot visualization of a single frame, and (b) a visualization of the temporal development of a scene. The first consists of a single image, showing the ground truth and prediction of a single algorithm for that frame, as well as the attention visualization for the five feature categories. For better readability the categories , , , and are denoted by Target, Same, Left, Right, and Street. (b) is the concatenation of several frames spanning a certain amount of time, along with the prediction of different algorithms.
Fig. 9 and Fig. 10 show the influence of attention on the network’s decision making, highlighting its correct and intuitive contribution.
Fig. 11 shows the temporal development of two scenes while plotting the output of three algorithms - RF, LSTM-E, and LSTM-A. Overall a superior performance of the recurrent models, especially LSTM-A, can be observed.
VII CONCLUSIONS
We have proposed an LSTM network with an attention mechanism for lane change prediction, which performs better than existing methods w.r.t. to different evaluation schemes. This is the first work applying such a model to this field, which tackles both prediction quality and understandability. We have also proposed new event-wise metrics catering to driver’s comfort. Results on a public dataset as well as fleet data clearly indicate a high level of comfort, in terms of earliness in prediction, false positive, and miss rate, with our proposed methods for the driver. Moreover, with visual analysis of critical cases we have demonstrated the effectiveness of using attention. In the future, analyzing fleet data with complex scenes using our attention mechanism can shine light on circumventing critical cases for fully autonomous driving. Such understandable mechanisms are helpful in diagnosing and minimizing accidents. This can eventually lead to improved path planning algorithms.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. S. A. National Highway, “Fars encyclopaedia - vehicles involved in single- and two-vehicle fatal crashes by vehicle manoeuvre,” 2009.
- 2[2] H. Woo, Y. Ji, H. Kono, Y. Tamura, Y. Kuroda, T. Sugano, Y. Yamamoto, A. Yamashita, and H. Asama, “Dynamic potential-model-based feature for lane change prediction,” in Int. Conf. on Systems, Man, and Cybernetics (SMC) , 2016.
- 3[3] J. Schlechtriemen, F. Wirthmueller, A. Wedel, G. Breuel, and K. Kuhnert, “When will it change the lane? a probabilistic regression approach for rarely occurring events,” in Intelligent Vehicles Symposium (IV) , 2015.
- 4[4] S. Patel, B. Griffin, K. Kusano, and J. J. Corso, “Predicting future lane changes of other highway vehicles using rnn-based deep models,” Int. Conf. on Intelligent Robots and Systems (IROS) , 2018.
- 5[5] “Ngsim project,” https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm.
- 6[6] W. Yao, H. Zhao, F. Davoine, and H. Zha, “Learning lane change trajectories from on-road driving data,” in Intelligent Vehicles Symposium (IV) , 2012.
- 7[7] W. Yao, H. Zhao, P. Bonnifait, and H. Zha, “Lane change trajectory prediction by using recorded human driving data,” in Intelligent Vehicles Symposium (IV) , 2013.
- 8[8] E. Balal, R. L. Cheu, and T. Sarkodie-Gyan, “A binary decision model for discretionary lane changing move based on fuzzy inference system,” Transportation Research Part C: Emerging Technologies , 2016.
