Goal-oriented Object Importance Estimation in On-road Driving Videos

Mingfei Gao; Ashish Tawari; Sujitha Martin

arXiv:1905.02848·cs.CV·May 9, 2019

Goal-oriented Object Importance Estimation in On-road Driving Videos

Mingfei Gao, Ashish Tawari, Sujitha Martin

PDF

Open Access

TL;DR

This paper introduces Object Importance Estimation (OIE) for on-road driving videos, combining visual cues and driving goals to identify critical objects influencing vehicle control decisions.

Contribution

It presents a novel framework integrating visual dynamics and driving goals for OIE, along with a real-world dataset and annotations for evaluation.

Findings

01

Goal-oriented method outperforms baselines

02

Significant improvement in turn scenarios

03

Object importance enhances driving control prediction

Abstract

We formulate a new problem as Object Importance Estimation (OIE) in on-road driving videos, where the road users are considered as important objects if they have influence on the control decision of the ego-vehicle's driver. The importance of a road user depends on both its visual dynamics, e.g., appearance, motion and location, in the driving scene and the driving goal, \emph{e.g}., the planned path, of the ego vehicle. We propose a novel framework that incorporates both visual model and goal representation to conduct OIE. To evaluate our framework, we collect an on-road driving dataset at traffic intersections in the real world and conduct human-labeled annotation of the important objects. Experimental results show that our goal-oriented method outperforms baselines and has much more improvement on the left-turn and right-turn scenarios. Furthermore, we explore the possibility of…

Tables3

Table 1. TABLE I: Overall statistics of the split parts (P1, P2 and P3).

	P1		P2		P3		Total
Session	S1	S2	S3	S4	S5	S6	NA
Location	MV	SV	MV	SV	MV	SV	NA
Video #	134	100	183	87	188	51	743
Anno. Frame #	2,541		3,087		2,983		8,611
Anno. Obj #	1,164		1,436		1,668		4,268

Table 2. TABLE II: Comparison between our Goal-Visual model and the baselines in terms of Average Precision (%) on turn left (Lt), straight pass (St), turn right (Rt) and all (All) frames. Avg. indicates the average value of the corresponding results among P1, P2 and P3.

	P1				P2				P3				Avg.
Test Set	Lt	St	Rt	All	Lt	St	Rt	All	Lt	St	Rt	All	Lt	St	Rt	All
Visual Model-Image	23.5	42.9	16.1	35.5	22.9	42.7	26.7	42.1	19.1	33.9	25.3	32.6	21.8	39.8	22.7	36.7
Visual Model	35.8	71.2	34.7	68.1	56.0	70.6	54.2	68.1	36.4	72.4	57.4	70.9	42.7	71.4	48.8	69.0
Goal-Geometry Model	41.1	32.9	22.8	32.1	32.5	42.6	19.7	40.6	25.6	45.8	30.2	41.8	33.1	40.4	24.2	38.2
Goal-Visual Model	48.9	72.2	42.8	70.2	61.1	71.7	70.3	70.3	45.2	75.8	61.7	72.0	51.7	73.2	58.3	70.8
Random Chance	4.3	5.2	2.7	4.8	5.3	5.9	14.0	8.4	5.7	6.7	4.7	6.1	5.1	5.9	7.1	6.4
UpperBound	90.9	81.6	72.7	81.7	90.9	81.7	90.8	90.8	90.4	89.2	90.9	89.4	90.7	84.1	84.8	87.3

Table 3. TABLE III: Comparison between our model and the baselines in terms of mean Average Precision (%) based on different object categories. Pn and Ve means ‘person’ and ‘vehicle’. Avg. indicates the average value of the corresponding results among P1, P2 and P3.

	P1			P2			P3			Avg.
	Pn	Ve	mAP	Pn	Ve	mAP	Pn	Ve	mAP	Pn	Ve	mAP
Visual Model-Image	17.7	42.6	30.15	29.6	46.9	38.25	21.2	39.8	30.5	22.8	43.1	33.0
Goal-Geometry Model	34.8	35.3	35.1	36.9	44.4	40.7	45.2	43.6	44.4	40.0	41.1	40.6
Visual Model	56.0	75.4	65.7	56.1	76.4	66.3	49.7	78.6	64.2	53.9	76.8	65.4
Goal-Visual Model	60.0	76.2	68.1	61.2	78.1	69.7	57.6	77.3	67.5	59.6	77.2	68.4

Equations6

s_{i}^{t} = W (L S T M (GoF_{i}^{t})) + b .

s_{i}^{t} = W (L S T M (GoF_{i}^{t})) + b .

R = s i g n \times (\frac{( 1 + y ^{^{'} 2} ) ^{\frac{3}{2}}}{y ^{^{''}}}),

R = s i g n \times (\frac{( 1 + y ^{^{'} 2} ) ^{\frac{3}{2}}}{y ^{^{''}}}),

\hat{I R} (l) = \frac{ω ( l )}{v ( l )} = \frac{α \times y r ( l )}{v ( l )},

\hat{I R} (l) = \frac{ω ( l )}{v ( l )} = \frac{α \times y r ( l )}{v ( l )},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Video Surveillance and Tracking Methods · Human Pose and Action Recognition

Full text

Goal-oriented Object Importance Estimation in On-road Driving Videos

Mingfei Gao1∗, Ashish Tawari2 and Sujitha Martin2 ∗Work done during an internship at the Honda Research Institute, USA.1The author is with the University of Maryland, College Park, MD, 20740. [email protected]2The authors are with the Honda Research Institute, Mountain View, CA, 94043. {atawari, smartin}@honda-ri.com

Abstract

We formulate a new problem as Object Importance Estimation (OIE) in on-road driving videos, where the road users are considered as important objects if they have influence on the control decision of the ego-vehicle’s driver. The importance of a road user depends on both its visual dynamics, e.g., appearance, motion and location, in the driving scene and the driving goal, e.g., the planned path, of the ego vehicle. We propose a novel framework that incorporates both visual model and goal representation to conduct OIE. To evaluate our framework, we collect an on-road driving dataset at traffic intersections in the real world and conduct human-labeled annotation of the important objects. Experimental results show that our goal-oriented method outperforms baselines and has much more improvement on the left-turn and right-turn scenarios. Furthermore, we explore the possibility of using object importance for driving control prediction and demonstrate that binary brake prediction can be improved with the information of object importance.

I Introduction

Human’s vision system plays a key role for perceiving and interacting with traffic participants under the complicated driving context. When looking into the dynamic scene, a driver can rapidly select the objects that are relevant for the driving task and make a control decision for effective and efficient driving. Inspired by this visual selection mechanism, driver’s attention has been studied in recent years in order to understand the human driving behavior and ultimately help the driving control system of autonomous vehicles. Existing works focus on pixel-level driver’s attention prediction by mimicking human gaze behavior [17, 22, 25]. However, there are at least two drawbacks of using human gaze: 1) human gaze is sometimes not directly related to the driving task. For example, drivers may look at the billboards for their own interests; 2) human gaze is sequential which makes it impossible to capture all the important information at the same time. Moreover, existing works only take the perceived driving video as input and do not consider the effect of the driver’s goal, while driver’s goal is an essential factor to select relevant objects. For example, objects relevant for making control decisions should be very different when the ego vehicle is turning right versus turning left.

To handle those limitations, we formulate the problem as Object Importance Estimation (OIE) in on-road driving videos. The important objects are defined as the road users, i.e., vehicles and persons, that are relevant for the ego vehicle’s driver to make the vehicle control decision. Our definition ensures that the important objects are directly related to the driving task and that multiple important objects can be captured at the same time. Static semantic driving context, e.g., traffic lights, line marks and drivable areas, can also influence the driving behavior. However, we only focus on the interactions with the road users and leave the static semantic driving context for future work. Fig. 1 shows an example of the scenario that our work focuses on. Visual dynamics of road users are important for our model to understand the driving scene. Also, the driver’s goal (where the vehicle is going) is essential for object importance estimation. For example, in Fig. 1, if the ego vehicle is turning left instead, all the pedestrians on the cross walk at the right side will not be as important to the ego vehicle.

To solve the proposed OIE problem, we present a novel framework where both the features of the dynamic road users (visual model) and the driving goal (goal model) are incorporated. In order to evaluate our framework, we collect an on-road driving dataset in the real world and annotate the important objects given the context. To provide more complex interactions between the road users and the ego vehicle, our dataset focuses on traffic intersections. Experiments show that our method largely outperforms the baselines, especially for the scenarios that the ego vehicle is turning left/right which demonstrates that modeling the driving goal is very important for our task. To explore the possibility of using important objects to improve driving control prediction, we conduct an experiment on binary brake prediction. Results show that the binary brake prediction can be improved with the information of the object importance.

II Related Works

II-A Driver’s Attention Prediction

Human Gaze based Approach. Existing works focus on driver’s attention prediction supervised by human gaze information [22, 17, 25]. Tawari and Kang propose a Bayesian framework for driver’s attention prediction where a fully convolutional network is utilized with only images as input in [22]. Palazzi et al. proposed a multi-branch model that incorporates RGB, optical flow and semantic segmentation clips in [17] and C3D [23] is used to extract features from multiple branches. In [25], Xia et al. propose a driver’s attention framework where a human weighted sampling strategy is used during training to handle critical situations. Kim et al. explore the idea of using driver’s attention to interpret the driving control prediction in [16].

Driver’s Attention Prediction Dataset. There are several datasets [20, 24, 7, 18, 1] can be used for driver’s attention prediction, but most of them are either restricted to limited settings or not publicly available. To the best of our knowledge, Dr(eye)ve [1] is the only public on-road driving dataset for the driver’s attention prediction task. It consists of 555,000 frames divided into 74 video sequences. Human gaze is captured by eye tracking glasses and projected to the corresponding on-road driving video frame. However, it is not suitable for our task, since 1) it has only per pixel saliency annotations based on human gaze which cannot be easily converted for important object labels; 2) it contains mostly scenarios of driving on the straight road (mostly the vehicle is trying to keep itself between lines or following another vehicle) which makes it not complicated enough for our task. Driving at the traffic intersections is a more appropriate scene for us, since it provides more opportunities for the ego vehicle to interact with other road users.

II-B Region based Object Detector

CNN detectors have achieved great success [12, 11, 19, 13, 10, 21, 8, 28]. Region based CNN (R-CNN) is one of the most popular frameworks. Girshick et al. initially proposed the two-stage R-CNN framework in [12] where object proposals are obtained first and then classified to different categories. Later, Fast R-CNN is proposed in [11] to speed up R-CNN [12] via end-to-end training/testing. However, it relies on external object proposal algorithms. Ren et al. present Faster R-CNN [19] which jointly trains the proposal generation and the detection branches in a single framework. Further more, He et al. extend Faster R-CNN in [13] and create an unified architecture for joint detection and instance segmentation. Our problem is related to R-CNN in a sense that we also assign some scores to the proposed object candidates. However, we estimate object importance under the driving context rather than differentiating object categories, e.g., dog and cat.

III Problem Formulation

The problem is formulated as goal-oriented object importance estimation where the inputs are on-road driving video clip and the goal of the ego vehicle. The outputs are the detected objects with importance scores at the last frame of the video clip. The planned path information which can be obtained from autonomous driving (AD) path planning module when the vehicle is driving online, is used to represent the goal of the vehicle.

Inspired by the R-CNN frameworks, we propose a two-stage framework which firstly generates object tracklinks from videos as object proposals and then classify the proposals to the binary classes, e.g., important object and background. Different from R-CNN detectors which generate proposals from static images, we track every object from the input video clip and treat the entire track link of an object as a proposal, since unlike the general object detection scenario where object categories, e.g., dog and cat, can be determined just from a static image, the object importance depends on the dynamics of objects through the video.

IV Model Description

As we mentioned in Sec. I, object importance depends on both the dynamic of the object itself and the driving goal of the ego vehicle. Thus, our method fuses the information from both parts. Due to the good performance of recurrent networks [26, 27, 9] on online action detection tasks, our framework is based on LSTM [15].

Our framework is shown in Fig. 2. The first branch describes our visual model. Multiple object tracking is performed on the input video clip. Thus, for each object candidate, $i$ , its bounding-box location, $B^{t}_{i}$ , is obtained at each time step $t$ . Note that each time step corresponds to each image frame in the input video clip. For each object candidate at every time step, high dimensional features $\textbf{f}^{t}_{i}$ are extracted to represent the appearance, motion and location of the object. We use a feature matrix $\textbf{F}^{t}_{i}=[\textbf{f}^{t-n+1}_{i},\textbf{f}^{t-n+2}_{i},...,\textbf{f}^{t}_{i}]$ to represent each object $i$ , in the video where $n$ is the length of the input clip. Without goal information, LSTM can be used directly with the $\textbf{F}^{t}_{i}$ as the input and the output is score $s^{t}_{i}$ of being an important object at time $t$ . We will use it as a baseline in our experiment section.

The second branch shows our goal model. We extract the goal-oriented feature $\textbf{g}^{t}$ at time $t$ from the AD path planning module. The extracted feature is concatenated with the features of each object in the image to form the final feature representation $\textbf{gof}^{t}_{i}=[\textbf{f}^{t}_{i},\textbf{g}^{t}]$ , for the object. The representation for the object within the whole clip is $\textbf{GoF}^{t}_{i}=[\textbf{gof}^{t-n+1}_{i},\textbf{gof}^{t-n+2}_{i},...,\textbf{gof}^{t}_{i}]$ . A one-layer LSTM model followed by a fully connected (FC) layer performs over $\textbf{GoF}^{t}_{i}$ to output the importance score for each object $i$ as shown in Eq. 1, where W and b indicate parameters of the FC layer. Softmax layer is used then to output the corresponding important probability.

[TABLE]

Visual Feature. Appearance, motion and location features are combined to represent the dynamic changes of an object. Appearance feature is extracted from the fc7 layer of Faster R-CNN [19] pretrained on the Pascal VOC2007 [5] and VOC2012 [6] trainval sets with Resnet101 [14] as the backbone. The appearance feature describes both the appearance of the object and the local context around the object [19]. Histogram of flow [4] with BIN=12 of each object bounding box is extracted as the motion feature. Location feature is represented by $(\frac{x^{t}_{i}}{W^{t}},\frac{y^{t}_{i}}{H^{t}},\frac{w^{t}_{i}}{W^{t}},\frac{h^{t}_{i}}{H^{t}})$ where $x^{t}_{i}$ , $y^{t}_{i}$ , $w^{t}_{i}$ and $h^{t}_{i}$ indicate the left-top corner of $B^{t}_{i}$ , its width and height. $W^{t}$ and $H^{t}$ indicate the width and height of image $t$ . The visual feature, $\textbf{f}^{t}_{i}$ , is the concatenation of these three features.

Goal-oriented Feature. At each time step, the planned path (with regard to distance in the vehicle-centric coordinates) can be obtained from the AD path planning module for an online driving task. As shown in Fig. 3, at each time step, discrete points are uniformly sampled with respect to distance to represent the planned path. Each sampled point is represented by $(x,y)$ which indicates the location of the point in the vehicle-centric coordinate in the real world. Radius of curvature, $R$ , is directly related to the turning behavior, so it can be used to represent each point on the path which can be calculated as in Eq. 2 given the location $(x,y)$ . For the straight road, the value of $R$ approaches infinity which is not appropriate for learning. So, we use $IR=\frac{1}{R}$ instead to describe a certain point in the planned path. At time $t$ , $\textbf{IR}^{t}=[IR(1),IR(2),...,IR(L)]$ is used to represent the whole planned path where $IR(l)$ indicate the value of $IR$ at the next $l$ distance units and $L$ indicates the maximum future distance our method considers. One FC layer is applied on $\textbf{IR}^{t}$ to extract the goal-oriented feature, $\textbf{g}^{t}$ .

[TABLE]

where $y^{{}^{\prime}}=\frac{dy}{dx}$ and $y^{{}^{\prime\prime}}=\frac{d^{2}y}{d^{2}x}$ . $sign=1$ when turning right and $sign=-1$ when turning left.

V Experiments

V-A Object Importance Estimation Dataset

Dataset Description. We collect 743 on-road driving videos at traffic intersections in the real world. Data collection was conducted from two different locations- Mountain View and Sunnyvale, CA, USA, totalling 6.3 hours. Each location contains 3 sessions of data. We believe that intersections contain more complicated driving scenarios and are more challenging for our task, so from each of the raw videos, a short video is trimmed. Each short video contains one pass of an intersection (25 meters before and after the intersection). After trimming, 2.7 hours of useful data are obtained. All the annotations and our experiments are conducted on the trimmed videos.

Annotations. When preparing the important object annotations, an annotator was asked to watch the on-road driving video and imagine he/she was driving the ego vehicle. All the objects that are relevant for the ego vehicle’s control decision are tightly located using bounding boxes. Note that the annotator was given the driving goal during the process of annotating each video sequence. For each video, important objects are labeled at every 30 frames. The frame sampling rate is 30 fps, thus labels were acquired at every second.

Further more, in order to understand our performance on different driving goals, i.e., turn left, straight pass and turn right, per-frame goal are annotated. The goal of an image frame is annotated as ‘turn left’ if the vehicle is expected to turn left at the next frame and so on.

Dataset Preprocessing. Important object labeling may be influenced by traffic signals. For example, when the red light is on, no objects are considered as important since none of them will influence the driver’s control decision. However, since we only consider the interactions with road users, we remove all the image frames where no important objects are labeled because of the traffic signals.

Dataset Statistics. After preprocessing, $8,166$ image frames are annotated, where $4,268$ important objects are obtained. Among all the labeled frames, $56.6\%$ images contain no important objects, $38.3\%$ contain one important object and $5.1\%$ frames include multiple important objects.

The annotated frame numbers of turn left, straight pass and turn right are $1004$ , $6591$ and $1016$ . The corresponding object numbers are $375$ , $3573$ and $320$ . Although we focus on traffic intersections, there are still more straight-pass frames than left/right-turn ones, which motivates us to evaluate the models based on different goals in order to avoid the results being dominated by the straight-pass scenario.

Train/test sets and statistics. The dataset with 6 sessions is grouped into three parts 111We use 3-fold cross validation instead of 10-fold due to not enough data., i.e., P1, P2 and P3. For cross validation, all models are evaluated at every part while trained on the other two parts. We ensure that data of each part was collected from different sessions, locations and times, and has similar amount of videos and category distributions of road users 222Since we do not have the object-category annotations. We use the result of object detection (with confidence threshold of $0.5$ ) to estimate the numbers of vehicles and persons at the annotated frames.. Tab. I and Fig. 4 show characteristics of each part. As shown, different parts have very similar statistics.

V-B Planned Path Approximation

Since the experiments are done in an off-line manner, data from the AD path planning module is not available. To evaluate our method, we recover (approximate) the planned path of our vehicle at a given time step as $\textbf{IR}^{t}\approx\hat{\textbf{IR}}^{t}=[\hat{IR}(1),\hat{IR}(2),...,\hat{IR}(L)]$ where $\hat{IR}(l)$ is calculated as in Eq. 3. We believe that it is easy to replace $\hat{\textbf{IR}}$ with IR when AD path planning module is available.

[TABLE]

where $\omega(l)$ , $v(l)$ and $yr(l)$ indicates angular velocity, velocity (kilometers per hour) and yaw rate (angle per second) at the next $l$ distance unit. One distance unite is $\frac{1}{3.6}$ meters. $\alpha$ is a scale number.

Both yaw rate and velocity can be obtained from the CAN bus sensors. Yaw rate values are negative when turning left while positive when turning right.

Examples of $\hat{IR}(l)$ for left turn, straight and right turn are shown in Fig. 5. As we can see, there are obviously discriminative patterns among the three driving goals, e.g., left turns have negative troughs, right turns have positive crests and straights are around zero .

V-C Baselines

Upperbound. We estimate importance scores for all the object proposals (tracklinks), so the final results depend on the quality of the detection and tracking algorithm. We assign the correct importance label for each proposal link in this baseline. Thus, it is the upper bound of our method and all the mistakes are due to the bad detection and tracking.

Random Chance. We randomly assign a value ( $\in[0,1]$ ) to each proposed tracklink as its important probability in this baseline. So, it is the lower bound of our method.

Visual model. It contains only the first branch of our framework which has only the visual features as input to the LSTM model. We want to see how the goal information can improve the prediction results quantitatively.

Visual model-Image. This model does not utilize the temporal information and predicts object importance scores by just observing the target image frame. In order to do that, we replace the LSTM model with one FC layer. This baseline is to compare with the standard object detection framework and evaluate how much the temporal information can help.

Goal-Geometry Model. This baseline has the same two-branch structure as our method except that appearance feature is removed and only motion and location features are used. Comparing it with our method will show if the method performs good if semantic local context is not given.

V-D Implementation Details

Tracking-by-detection [2] framework is used to conduct object tracking, where Faster R-CNN [19] with Resnet101 is used for detection and SORT [3] is used for tracking. Some of the objects may not start at the first frame or last till the end. We only keep the objects that still exist at the last frame and pad [math]s in the front if they do not start at the first frame.

The length of video clip, $n$ , is set to $30$ . We set $L$ =40 which is roughly 10 meters in the real world. $\alpha$ in Eq. 3 is set to 1. For the visual model, we set length of the LSTM hidden layer to be $256$ and the FC layer in goal model is set to be $16$ . For image based visual model, the FC layer has $1,024$ units. Weighted-cross-entropy loss is used to optimize our model and all the baselines. The weights for positive and negative samples are inversely proportional to their sample numbers in one training batch.

V-E Experimental Results

Comparisons between our method, i.e., Goal-Visual Model, and the baselines using average precision (AP) are shown in Tab. II. Our method largely outperforms Random Chance (“by-chance” approach). Comparing Visual Model with Visual Model-Image, we see that the temporal information is essential for our task. Without temporal modelling, the overall AP drops by $32.3\%$ . With the goal information, our Goal-Visual Model outperforms the Visual Model by about $2\%$ in terms of AP.

To evaluate the effectiveness of local visual scene context, our method is compared with Goal-Geometry Model. The Goal-Geometry Model only captures the motion and location information of a road user and combines it with the goal of the ego vehicle, without knowing the scene semantic. As it is shown, our method largely outperforms this baseline which demonstrates the usefulness of the scene context.

To evaluate our performance on different driving goals, we validate our method and the baselines on turn left, straight pass and turn right frames separately. Intuitively, our goal model should help more on the turn left and turn right cases compared to the straight pass. From the results in Tab. II, our method largely improves the Visual Model by $9\%$ AP for turn left and by $9.5\%$ for turn right.

We are also interested in our performance on different object categories, i.e., person and vehicle. Since, we do not have ground truth of the object categories, we generate the class label using the detection results. We match each labeled important object to a detected object if they have the largest Intersection over Union (IoU) and the $IoU>0.5$ . It is not guaranteed that every important object will find a match, since the detector is not perfect. However, experiment shows that around $95\%$ of important objects are matched, so we ignore the small amount of unmatched ones. Comparisons between our method and the baselines are shown in Tab. III, which demonstrates that our model outperforms all the baselines in terms of mAP. Specifically, we observe that performance on the ‘person’ category is largely improved with goal information. Goal-Visual Model improves by around $6\%$ on ‘person’ compared to Visual Model. It may due to the fact that most important persons are those who are walking cross the road. It is essential for the model to know where the ego vehicle is going in order to infer if a pedestrian on a certain side is important.

Qualitative results on turn left and turn right are shown in Fig. 6. As it is shown, knowing the driving goal can help capture important objects on (or coming to) our future path, e.g., turn left(a)(c)(d) and turn right(d). It can also filter out objects that are impossible to block our way based on their motion and location, e.g., turn left(b) and turn right(a)(b)(c).

Three major failure cases are shown in Fig. 7. The first one is because of the bad detection/tracking results. When the detection of the important object fails, there is no way for our framework to correct it. That is why our upper bound is not $100\%$ AP. The second case is a result of missing global scene context. The comparison shows that for the two parked car, one is thought as important, but the other one is not. Based on our observation, the annotator tends to annotate the parked car if the road is narrow. The third case is due to the lack of communication among road users. For example, if we remove the labeled car in the last image, all the pedestrians should be important. They are not labeled as important because there is a closer car stopping the ego vehicle hitting them. Since our method does not model the interactions among road users, it is hard for an object to know the status of other objects. Future works are needed to solve these three failure cases.

V-F Are Road Users Equally Important?

For a proof-of-concept, we propose a binary brake prediction (BBP) framework with object importance as a input.

BBP is a simplified version of brake prediction task which has binary labels, $y_{brake}$ , instead of continuous brake values (can be obtained from CAN bus data), $v_{brake}$ ( $y_{brake}=1$ if $v_{brake}>0$ and $y_{brake}=0$ otherwise). The input of BBP is a video clip and output is the brake probability of the ego vehicle in the last frame.

We assume that brakes depend only on the interaction between the road users and the ego vehicle, since we have removed the traffic-light related frames from our dataset. The visual model in Fig 2 is used to predict brake score, $\textit{s}^{t}_{i}$ , at time $t$ of the ego vehicle given road user, $i$ , in the input video clip. The final brake score, $\textit{s}^{t}_{fuse}=\underset{i}{\sum}{(w^{t}_{i}*\textit{s}^{t}_{i})}$ , is obtained by fusing predicted scores based on all the road users in a weighted sum manner. Our model use the predicted important probability to be the weight of each object. Our intuition is that more important objects will have bigger impacts on the brake decision. The baseline uses the same weight ( $0.5$ ) for all the objects to indicate that all objects in the scene equally contributed to the brake.

Experimental results suggest that our method improves the baseline by $4.3\%$ , $1.7\%$ and $1.3\%$ AP in the P1, P2 and P3, respectively, which demonstrates the potential usefulness of the object importance.

VI Conclusion

We propose a new problem as Object Importance Estimation (OIE) in on-road driving videos to understand the human visual selection mechanism under the driving context. We present a novel framework to handle the problem where both the visual dynamics of road users and the goal of the ego vehicle are taken into consideration. To evaluate the problem, we collect an on-road driving dataset and annotate the important objects given the video clip. Experimental results demonstrate the effectiveness of our idea. Moreover, we explore the potential usage of the OIE by incorporating it into a binary brake prediction framework. Experiments show that important objects can help to improve the prediction.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara. Dr (eye) ve: a dataset for attention-based tasks with applications to autonomous and assisted driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , pages 54–60, 2016.
2[2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on , pages 1–8. IEEE, 2008.
3[3] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In Image Processing (ICIP), 2016 IEEE International Conference on , pages 3464–3468. IEEE, 2016.
4[4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In European conference on computer vision , pages 428–441. Springer, 2006.
5[5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results. http://www.pascal-network.org/challenges/VOC/voc 2007/workshop/index.html.
6[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC 2012) Results. http://www.pascal-network.org/challenges/VOC/voc 2012/workshop/index.html.
7[7] L. Fridman, P. Langhans, J. Lee, and B. Reimer. Driver gaze region estimation without use of eye movement. IEEE Intelligent Systems , 31(3):49–56, 2016.
8[8] M. Gao, A. Li, R. Yu, V. I. Morariu, and L. S. Davis. C-wsl: Count-guided weakly supervised localization. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 152–168, 2018.