An Attention-based Recurrent Convolutional Network for Vehicle Taillight Recognition
Kuan-Hui Lee, Takaaki Tagawa, Jia-En M. Pan, Adrien Gaidon, Bertrand, Douillard

TL;DR
This paper presents an end-to-end deep learning framework combining CNN, LSTM, and attention mechanisms to accurately recognize vehicle taillights, such as turn and brake signals, from image sequences for automated driving applications.
Contribution
It introduces an attention-based recurrent convolutional network that effectively captures spatial and temporal features for vehicle taillight recognition, outperforming existing methods.
Findings
Achieved higher accuracy than state-of-the-art methods on UC Merced Vehicle Rear Signal Dataset.
Demonstrated the effectiveness of attention mechanisms in focusing on relevant spatial and temporal features.
Validated the approach's potential for improving automated driving systems.
Abstract
Vehicle taillight recognition is an important application for automated driving, especially for intent prediction of ado vehicles and trajectory planning of the ego vehicle. In this work, we propose an end-to-end deep learning framework to recognize taillights, i.e. rear turn and brake signals, from a sequence of images. The proposed method starts with a Convolutional Neural Network (CNN) to extract spatial features, and then applies a Long Short-Term Memory network (LSTM) to learn temporal dependencies. Furthermore, we integrate attention models in both spatial and temporal domains, where the attention models learn to selectively focus on both spatial and temporal features. Our method is able to outperform the state of the art in terms of accuracy on the UC Merced Vehicle Rear Signal Dataset, demonstrating the effectiveness of attention models for vehicle taillight recognition.
| Class | OOO | BOO | OLO | BLO | OOR | BOR | OLR | BLR | Total |
|---|---|---|---|---|---|---|---|---|---|
| Train samples | 3256 | 2247 | 761 | 903 | 707 | 373 | 149 | 246 | 8462 |
| Test samples | 1526 | 1459 | 520 | 464 | 270 | 384 | 219 | 69 | 4811 |
| Method | OOO | BOO | OLO | BLO | OOR | BOR | OLR | BLR | Total |
|---|---|---|---|---|---|---|---|---|---|
| C3D [16] | 93.1 | 89.9 | 91.7 | 95.3 | 96.3 | 96.4 | 99.5 | 100 | 93.2 |
| I3D [17] | 96.3 | 93.1 | 94.2 | 85.1 | 97.8 | 95.1 | 98.2 | 100 | 94.2 |
| CNN-LSTM [5] | 97.5 | 90.4 | 95.8 | 88.4 | 95.2 | 87.0 | 89.5 | 92.8 | 93.0 |
| Proposed (S) | 96.8 | 92.9 | 95 | 89.2 | 95.9 | 87.8 | 94.5 | 100 | 93.9 |
| Proposed (T) | 98.4 | 95.1 | 90.8 | 89.2 | 97.0 | 92.4 | 90.0 | 100 | 94.8 |
| Proposed (S+T) | 98.6 | 95.3 | 90.8 | 93.3 | 97.0 | 97.0 | 97.7 | 98.6 | 96.1 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMemory Network
An Attention-based Recurrent Convolutional Network
for Vehicle Taillight Recognition
Kuan-Hui Lee, Takaaki Tagawa, Jia-En M. Pan, Adrien Gaidon, Bertrand Douillard All the authors are with Toyota Research Institute, CA USA. {kuan.lee, takaaki.tagawa, marcus.pan, adrien.gaidon, bertrand.douillard}@tri.global
Abstract
Vehicle taillight recognition is an important application for automated driving, especially for intent prediction of ado vehicles and trajectory planning of the ego vehicle. In this work, we propose an end-to-end deep learning framework to recognize taillights, i.e. rear turn and brake signals, from a sequence of images. The proposed method starts with a Convolutional Neural Network (CNN) to extract spatial features, and then applies a Long Short-Term Memory network (LSTM) to learn temporal dependencies. Furthermore, we integrate attention models in both spatial and temporal domains, where the attention models learn to selectively focus on both spatial and temporal features. Our method is able to outperform the state of the art in terms of accuracy on the UC Merced Vehicle Rear Signal Dataset, demonstrating the effectiveness of attention models for vehicle taillight recognition.
I INTRODUCTION
Since the 2005 DARPA Grand Challenge and the 2007 Urban Challenge, researchers have developed many successful technologies towards automated driving, including key perception tasks [1, 2, 3]. Nonetheless, higher level concepts such as intent prediction, human-machine interaction, and vehicle-to-vehicle communication, are still open questions. In particular, intent prediction of ado vehicles is one of the most critical features for automated driving safety.
In this work, we propose a method to perceive one of the key explicit intent signals the ego vehicle needs to understand from the ado vehicles: the status of turn and brake lights. Numerous methods have been proposed to recognize vehicle taillight signals. Typical computer vision pipelines [4, 5, 6, 7, 8, 9, 10, 11, 12] mainly detect and extract ado vehicles’ bounding boxes in the frames captured by the ego vehicle’s front-facing camera, and classify recognize the signal states. This also requires encoding temporal dependencies since the signal states are determined by changes of activation of the lights. Hence, vehicle taillight recognition can be treated as an application of video recognition. This enables leveraging progress in related deep learning models, especially in action recognition or image caption generation. To tackle the problem, there are two main types of methods based on the network model: the Recurrent Neural Network (RNN) based methods and the 3-D CNN based methods. The RNN based methods [13, 14, 15] usually use the CNN for feature extraction and apply the features to a recurrent network, such as the LSTM. The 3-D CNN based methods [16][17] apply convolution operations not only to the spatial domain but also to the temporal domain. Besides, facilitated by optical flow, a two-stream fusion architecture is proposed to fuse spatial and temporal dependencies together [18][19]. Such deep learning methods have achieved promising results, inspiring the researchers to apply the ideas and the techniques to vehicle taillight recognition [4][5].
In this paper, we leverage similar insights and propose a novel end-to-end recurrent convolutional neural network for vehicle taillight recognition. Our model is based on the CNN-LSTM framework paired with a spatio-temporal attention model that emphasizes not only focal regions of the images but also key time steps of the sequences. As a result, the proposed method is able to outperform the state of the art, achieving better accuracy performance. Furthermore, our model is more interpretable, as our spatio-temporal attention maps allow us to verify whether the inference relied on appropriate causal factors (spatio-temporal regions) for its prediction. In summary, our main contributions are:
- •
We propose an end-to-end deep neural network for vehicle taillight recognition, where the architecture is mainly based on the CNN-LSTM framework.
- •
The proposed method integrates a spatial attention model into the framework, so as to emphasize ”regions of interest” of the images.
- •
Our model also contains a temporal attention model to concentrate on certain time steps that are key to the recognition task.
The rest of the paper is organized as follows. Section II gives a brief overview of related works. Section III depicts the proposed method and network architectures. The experimental results are shown in Section IV, followed by the conclusion in Section V.
II RELATED WORK
II-A Vehicle Taillight Recognition
Several taillight recognition methods detect the signal states by thresholding on color features [6][7]. Chen et al. [8] use color thresholds to train an AdaBoost based classifier for the existence of the turn signals, and then use reflectance contrast to tell the directions. Frohlich et al. [20] localize the turn light regions according to absolute difference between two successive frames, and train an AdaBoost based classifier for the extracted features which is transferred to the frequency domain. Chen et al. [21] propose a response function to determine the taillights regions. A high-pass mask is used to find the brighter in-region pixels for classifying the states accordingly. Several methods [9, 10, 22] utilize tracking to localize taillights. The states are classified based on the luminance channel observed over time on the two detected light regions. Cui et al. [11] detect the turn lights based on thresholds, and use support vector machine (SVM) to classify signals into four states.
Recently, deep learning approaches have also been applied to learn features for taillights. Zhong et al. [12] train a fully convolutional network (FCN) [23] model to identify the light regions and the features extracted within the regions are classified by a linear SVM. Wang et al. [4] train a CNN model from vehicle rear appearances to tell the state of the brake signals image by image. In order to take temporal dependencies into account, Hsu et al. [5] propose a CNN-LSTM structure to learn eight states of taillights, where the networks for brake and turn signals are trained separately.
II-B Attention-Based Models
Neural attention models focus on certain key parts of the data to improve inference. Soft-selection has recently proven effective in many recurrent applications, such as machine translation [24, 25, 26], image caption generation [27], image recognition [28][29], and human action recognition [30, 31, 32].
In machine translation, attention models are widely applied in the sequence-to-sequence framework proposed in [33]. Bahdanau et al. [24] apply a soft-alignment mechanism to jointly translate and selectively align words. Luong et al [25] propose local and global attention mechanisms where the global (resp. local) approach attends to the whole (resp. a subset of the) sequence. Gehring et al. [26] introduce an architecture with CNNs to optimize GPU usage. The proposed method also equips each decoder layer with a separate attention model.
Attention mechanisms have also proven effective for a wide array of computer vision tasks. In [27], Xu et al. adapt the attention mechanism to the spatial domain such that the proposed network automatically learns to concentrate on salient objects. Ba et al. [28] propose a RNN model trained with reinforcement learning to recognize multiple objects in images, where the network focuses on the most relevant regions of the input image. Mnih et al. [29] train a neural network to learn alignments between image objects and agent actions in dynamic control problems.
Beyond text and images, several works also investigate attention models for human action recognition, achieving significant performance improvements [27, 28, 29, 30, 32, 31]. Yao et al. [34] propose a 3-D CNN-RNN encoder-decoder architecture that captures not only local spatio-temporal information but also global context via an attention mechanism. Sharma et al. in[30] apply a spatial attention model to video frames and embed the attention weights into a multi-layer LSTM network for action recognition. In [31], Song et al. propose an end-to-end training framework based on the LSTM model with spatial and temporal attention models for human joints. Yeung et al. [32] take adjacent frames within a sliding window into account and propose a recurrent LSTM-based model, where a temporal attention model is learned to enhance the performance of dense labeling of actions.
III PROPOSED FRAMEWORK
We propose a CNN-LSTM based network with spatial and temporal attention mechanisms for vehicle taillight recognition. An overview of the architecture is shown in the Figure 1. By using object detection techniques, a sequence of one vehicle’s bounding boxes images can be extracted from the video frames. From the sequence, chunks of the images are sampled by window-sliding along the temporal direction. Assume is a chunk with images, where is the image in the chunk. Inspired by [18][19] which leverage optical flow to present the temporal dependencies, we first calculate the frame difference by using the SIFT flow algorithm [35] to align the vehicles in successive frames:
[TABLE]
where is a warping function based on their SIFT flow, from the frame to the next frame. Combining with , the input of the network becomes .
Each image is forwarded to certain CNN layers (), to obtain deep features in the layers, denoted by . The input of the spatial attention model, , is forwarded to two 2-D convolutional layers, and the output is concatenated with the last hidden variable from the LSTM. Then, the concatenated tensors are forwarded to the fully connected layers (FC), the hyperbolic tangent (tanh) layer, and the softmax layer, so as to obtain the attention weights . The deep features perform element-wise multiplication with , and are then forwarded to the rest part of CNN () for latent features . The latent features are then forwarded to a LSTM network for encoding temporal dependencies.
The input and the output of the LSTM network are recurrent over time steps. At time step , the input is the latent feature from the last fully connected layer of the CNN, the hidden variable , and the internal states , while the output is the hidden variable and the memory cell for the next time step. Both and are updated and then passed to the LSTM network at every time step.
The LSTM network outputs a set of hidden variables and a set of memory cells , which are used in temporal attention model. The temporal attention model adopts the attention model proposed in [26], which calculates the attention by taking the dot-product between decoder context and encoder representations. Instead of multiple decoder layers, we only use single layer for the output of the LSTM network. The input and are fused to be a set of state summaries . Then, H and D perform a matrix product, where the results are followed by a softmax layer for attention selection. The attention weights are then applied to H. The adjusted hidden variables are followed by the fully connected layers and the tanh layer to obtain class probability distribution .
III-A Spatial Attention with Region Selection
Visual attention has been shown to be an effective mechanism in image applications, by selectively focusing on certain regions in images [27, 28, 29]. In this work, we propose an attention model for region selection. First, the inputs , as well as the deep features, are forwarded to two 2-D convolutional layers and with kernel size 1. The has both input and output channel , while the has input channel and output channel . Then, the 2-D attention weights for each time step at coordinate is defined by:
[TABLE]
[TABLE]
where , , are the learnable parameter matrices, are the bias vectors. The attention weights is a 2-D matrix where each cell spatially corresponds to the vector in . A softmax selection is adopted to emphasize the corresponding regions in the latent features. By performing element-wise product with , the weighted are forwarded to the .
III-B Temporal Attention with Frame Selection
In a sequence, the input at each time contains temporal information with different importance weights for the final classification. For example, the moment when taillights are flashing is more valuable than other frames when the network is trying to recognize the state of a vehicle’s taillights. Hence, we apply an attention model along the temporal direction to emphasize critical moments for vehicle taillight recognition.
Based on the outputs provided by the LSTM network, we integrate the attention model proposed in [26] into the proposed framework. The temporal attention corresponding to the state summary and the hidden variable is computed as a dot-product between and , and then followed by a soft-selection:
[TABLE]
The state summary at time step is defined by:
[TABLE]
where , are the learnable parameter matrices, and is a bias vector. This implies how the input contributes to the output. Therefore, the output hidden variable is adjusted according to :
[TABLE]
The adjusted hidden variable is then forwarded to the fully connected layers and tanh layer to obtain final prediction for time step:
[TABLE]
where are the learnable parameter matrices, and are the bias vectors.
III-C Training Procedure
During training, the objective loss is cross entropy loss between the predictions and labels. To take the temporal dependence of an input sequence into account, we focus on the prediction of the last frame, i.e., , which contains sufficient information from all the previous frames. In other words, we compute the loss of the last frame and backpropagate all frames in the sequence with the same loss.
Due to the mutual influence of the attention models and the LSTM, the whole network needs to be optimized effectively. First, the CNN along with LSTM network are trained from the scratch. This allows the main stream of the network to achieve certain convergence. Then, the CNN + LSTM network, along with the temporal attention model, are fine-tuned based on the pre-trained model in the first step. Finally, the whole network, i.e., both spatial and temporal attention models with CNN-LSTM network, is fine-tuned from the pre-trained model in the second step. Such progressive training can ensure the network to converge effectively and perform better results.
IV EXPERIMENTAL RESULTS
We evaluate the proposed method on UCMerced’s Vehicle Rear Signal Dataset [5][36], which contains 649 videos including 63,637 frames. The sequences are recorded during the daytime under real-world driving conditions with various vehicle types. There are eight different taillight states based on all combinations of brake and turn lights. As shown in Figure 2, each state is denoted by 3 letters of ”B” (brake), ”L” (left), and ”R” (right); either the corresponding letter of the signal when it is on, or a letter O for off. Table I shows the distribution of the states used in the experiments.
In this work, we adopt ResNet50 [37] as the CNN architecture with the ImageNet pre-trained model. The LSTM network has one recurrent layer with hidden size of 256. The input of the network includes 128 batches each time, and each one contains 16 frames extracted from a sequence with sliding window. A chunk of sequence which includes 16 images allows us to obtain at least one cycle of the signal state. Each chunk is given the ground truth label based on the video sequence they belong to and resized to pixels before feeding to the network. We augment the raw data with color balance, random contrast, random brightness, and horizontal flipping. During the error backpropagation process, we compute the bootstrapped cross entropy loss [38] with ratio . The SGD optimizer is used to adjust all parameters in our PyTorch implementation.
IV-A Quantitative Results
To evaluate the performance, we compare the proposed method to the baselines, in terms of average accuracy from the predictions of all sequence chunks in each video. The baselines we select are two 3-D CNN based methods: C3D [16] and I3D [17]. The I3D is embedded with Inception-v1. Moreover, we also compare to a CNN+LSTM based method [5]. For fair comparison, all the inputs are the first frame and the frame difference obtained by Eq.(1), and the network settings (length of sequence chunk, hidden size, etc) remain the same.
Table II shows the quantitative results of the comparisons. The proposed (S) method is better () than the CNN-LSTM. This means that spatial attention model improves the performance. The proposed (T) method performs much better than the the CNN-LSTM and 3-D CNN based methods. This implies that the temporal attention model can encode the temporal dependencies better than the 3-D convolution. When equipped with both spatial and temporal attention models, the proposed (S+T) method outperforms the baseline methods with accuracy. Overall, facilitated by the attention models, the proposed method is able to tell the taillight states effectively.
IV-B Feature Selection in Spatial Attention
In spatial attention model, extracting features from different layers influence the performance because of different spatial dimensions. To evaluate the influence, we remain the same settings used in Section IV-A, but input different layers’ outputs to the spatial attention model. Figure 3 shows the accuracy performance with different outputs of the layers, with corresponding layer names in the ResNet50. The results show that the performance becomes better when selecting deeper features. The performance becomes worse than without spatial attention model when selecting the conv_1x and the conv_2x layers. This implies that paying attention to more abstract (deeper) features is more effective than paying attention to less abstract features.
IV-C Visualization
To analyze how the spatio-temporal attention model works in the framework, we visualize the attention weights during inference. For spatial attention, we normalize the weights in 2-D and visualize intensity with green color. For spatial attention model, we select , i.e., the outputs of the conv5_x layer, as deep features. This means that is a matrix. To properly visualize the attention weights, we resize and blend with input frames.
Figure 4(a) shows another example of the ”OOR” case, where only the right-turn signal is flashing. From frame 1, the spatial attention is uniformly distributed on the rear of the vehicle. When the right-turn signal gradually switches off from frame 2 to frame 3, spatial attention starts to concentrate on the region of the right-turn signal until the last frame. Then, the temporal attention pays much attention to the frame 4, while the right-turn signal is on at this time step. Meanwhile, the spatial attention weight goes up to , which implies that the network pays more attention to significant changes of the signal.
Figure 4(b) shows another example of the ”BLR” case, where the brake signal is on and both turn signals are flashing. As shown in the figure, the spatial attention focuses on both sides of the vehicle. Both left- and right-turn signals start to turn on at frame 4. This triggers the network to pay temporal attention to the frame 4. The signals become off until the frame 7, while the temporal attention weight goes up to , which is higher than that in frame 4. This may be caused by the network learning a cycle of signal flashing for this chunk of sequence.
IV-D Failure Cases
Figure 5 shows several failure cases in the test set. Most failure cases are due to lighting reflection which makes signal-off recognized as the signal-on, as the examples show in Figure 5(a), 5(c), 5(d), and 5(f). A few edge cases shown in Figure 5(b) and 5(e) are false alarms, where the signal-off is treated as flashing signals. These are due to the reflection of the other vehicle’s flashing light, or a sudden change of sunlight reflection.
V CONCLUSIONS
We propose a method that learns to recognize eight different taillight states from a video sequence. Our method integrates both spatial and temporal attention models into an LSTM network, so as to effectively exploit deep features in both spatial and temporal domains. The experimental results show that the proposed method is able to effectively predict each taillight states in real-world traffic scenes.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Leonard, J. How, S. Teller, M. Berger, S. Campbell, G. Fiore, L. Fletcher, E. Frazzoli, A. Huang, and S. Karaman, “A perception-driven autonomous urban vehicle,” Journal of Field Robotics , vol. 25, no. 10, pp. 727–774, 2008.
- 2[2] K.-H. Lee, Y. Kanzawa, M. Derry, and M. R. James, “Multi-target track-to-track fusion based on permutation matrix track association,” in IEEE Intelligent Vehicles Symposium . IEEE, 2018, pp. 465–470.
- 3[3] C. Badue, R. Guidolini, R. V. Carneiro, P. Azevedo, V. B. Cardoso, A. Forechi, L. F. R. Jesus, R. F. Berriel, T. M. Paixão, F. Mutz, T. Oliveira-Santos, and A. F. D. Souza, “Self-driving cars: A survey,” ar Xiv preprint ar Xiv:1901.04407 , 2019.
- 4[4] J.-G. Wang, L. Zhou, Y. Pan, S. Lee, Z. Song, B. S. Han, and V. B. Saputra, “Appearance-based brake-lights recognition using deep learning and vehicle detection,” in IEEE Intelligent Vehicles Symposium , 2016, pp. 815–820.
- 5[5] H.-K. Hsu, Y.-H. Tsai, X. Mei, K.-H. Lee, N. Nagasaka, D. Prokhorov, and M.-H. Yang, “Learning to tell brake and turn signals in videos using cnn-lstm structure,” in IEEE International Conference on Intelligent Transportation Systems , 2017.
- 6[6] P. Thammakaroon and P. Tangamchit, “Predictive brake warning at night using taillight characteristic,” in IEEE International Symposium on Industrial Electronics , 2009, pp. 217–221.
- 7[7] H.-T. Chen, Y.-C. Wu, and C.-C. Hsu, “Daytime preceding vehicle brake light detection using monocular vision,” IEEE Sensors Journal , vol. 16, no. 1, pp. 120–131, 2016.
- 8[8] D.-Y. Chen, Y.-J. Peng, L.-C. Chen, and J.-W. Hsieh, “Nighttime turn signal detection by scatter modeling and reflectance-based direction recognition,” IEEE Sensors Journal , vol. 14, no. 7, pp. 2317–2326, 2014.
