FARSA: Fully Automated Roadway Safety Assessment

Weilian Song; Scott Workman; Armin Hadzic; Xu Zhang; Eric Green; Mei; Chen; Reginald Souleyrette; Nathan Jacobs

arXiv:1901.06013·cs.CV·January 21, 2019

FARSA: Fully Automated Roadway Safety Assessment

Weilian Song, Scott Workman, Armin Hadzic, Xu Zhang, Eric Green, Mei, Chen, Reginald Souleyrette, Nathan Jacobs

PDF

Open Access 1 Repo

TL;DR

This paper introduces FARSA, a deep learning system that automates roadway safety ratings from street-level images, significantly reducing manual effort and improving accuracy in safety assessments.

Contribution

We developed a deep neural network with task-specific attention for automated, rapid, and multi-attribute roadway safety assessment from panoramic images.

Findings

01

Semi-supervised training reduces overfitting.

02

Multi-task learning improves rating accuracy.

03

Fast inference per image in milliseconds.

Abstract

This paper addresses the task of road safety assessment. An emerging approach for conducting such assessments in the United States is through the US Road Assessment Program (usRAP), which rates roads from highest risk (1 star) to lowest (5 stars). Obtaining these ratings requires manual, fine-grained labeling of roadway features in street-level panoramas, a slow and costly process. We propose to automate this process using a deep convolutional neural network that directly estimates the star rating from a street-level panorama, requiring milliseconds per image at test time. Our network also estimates many other road-level attributes, including curvature, roadside hazards, and the type of median. To support this, we incorporate task-specific attention layers so the network can focus on the panorama regions that are most useful for a particular task. We evaluated our approach on a large…

Tables3

Table 1. Table 1 : Parameter settings for each method.

Method	$λ_{s^{1}}$	$λ_{s^{2}}$	$λ_{s}$	$λ_{m}$	$λ_{u}$
Baseline	1	$10^{2}$	1	0	0
M1	1	$10^{2}$	1	0	0
M2	1	$10^{2}$	0.1	1	0
M3	1	$10^{2}$	1	0	0.001
M4	1	$10^{2}$	0.1	1	0
Ours	1	$10^{2}$	0.1	1	0.001

Table 2. Table 2 : Top-1 accuracy for each method.

Method	Attn.	Multi.	Unsuper.	Acc.
Baseline				43.06
M1	X			43.28
M2		X		43.56
M3	X		X	43.63
M4	X	X		45.68
Ours	X	X	X	46.91

Table 3. Table 3 : Multi-task evaluation for our architecture. R. = roadside, P. = passenger, and D. = driver.

Label type	Top-1	Random	% inc.
Area type	97.40	50.39	47.01
Lane width	77.22	33.25	43.97
Curve quality	63.57	32.78	30.79
P. side land use	54.88	28.38	26.50
D. side land use	52.72	28.32	24.40
D. side sidewalk	77.17	57.00	20.17
Vehicle parking	52.39	33.06	19.33
Road condition	51.38	33.35	18.03
P. side sidewalk	60.34	43.54	16.80
Intersection quality	83.06	66.69	16.37
Intersection road volume	28.21	14.18	14.03
D. side paved shoulder	38.61	24.71	13.90
P. side paved shoulder	37.56	24.42	13.14
Number of lanes	45.78	33.13	12.65
R. D. side distance	37.51	25.47	12.04
R. P. side distance	36.45	24.98	11.47
Median type	57.30	46.86	10.44
R. P. side object	38.89	29.59	9.30
Upgrade cost	42.82	33.93	8.89
R. D. side object	49.24	41.15	8.09
Intersect channel	56.25	49.88	6.37
Bicycle facilities	75.19	71.51	3.68
Curvature	25.83	25.52	0.31

Equations16

L = λ_{s} L_{s} + λ_{m} L_{m} + λ_{u} L_{u} .

L = λ_{s} L_{s} + λ_{m} L_{m} + λ_{u} L_{u} .

L_{s^{1}} = - \frac{1}{N} i = 1 \sum N y_{i} (l_{i}) lo g \overset{y}{^}_{i} (l_{i})

L_{s^{1}} = - \frac{1}{N} i = 1 \sum N y_{i} (l_{i}) lo g \overset{y}{^}_{i} (l_{i})

L_{s^{2}} = \frac{1}{N} i = 1 \sum N ∥ F (\overset{y}{^}_{i}) - F (y_{i}) ∥_{2}^{2}

L_{s^{2}} = \frac{1}{N} i = 1 \sum N ∥ F (\overset{y}{^}_{i}) - F (y_{i}) ∥_{2}^{2}

L_{s} = λ_{s^{1}} L_{s^{1}} + λ_{s^{2}} L_{s^{2}} .

L_{s} = λ_{s^{1}} L_{s^{1}} + λ_{s^{2}} L_{s^{2}} .

L_{u} = \frac{1}{∣ U ∣} (a, b) \in U \sum ∥ F (\overset{y}{^}_{a}) - F (\overset{y}{^}_{b}) ∥_{2}^{2} .

L_{u} = \frac{1}{∣ U ∣} (a, b) \in U \sum ∥ F (\overset{y}{^}_{a}) - F (\overset{y}{^}_{b}) ∥_{2}^{2} .

w_{l} = \frac{N}{K * co u n t ( l )}

w_{l} = \frac{N}{K * co u n t ( l )}

L_{s^{1}} = - \frac{1}{N} i = 1 \sum N w_{l_{i}} y_{i} (l_{i}) lo g \overset{y}{^}_{i} (l_{i})

L_{s^{1}} = - \frac{1}{N} i = 1 \sum N w_{l_{i}} y_{i} (l_{i}) lo g \overset{y}{^}_{i} (l_{i})

L_{s^{2}} = \frac{1}{N} i = 1 \sum N w_{l_{i}} ∥ F (\overset{y}{^}_{i}) - F (y_{i}) ∥_{2}^{2}

L_{s^{2}} = \frac{1}{N} i = 1 \sum N w_{l_{i}} ∥ F (\overset{y}{^}_{i}) - F (y_{i}) ∥_{2}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arminHadzic/Panorama_Valhalla
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Remote Sensing and LiDAR Applications · Automated Road and Building Extraction

Full text

FARSA: Fully Automated Roadway Safety Assessment

Weilian Song1 Scott Workman1 Armin Hadzic1 Xu Zhang2,3

Eric Green2,3 Mei Chen2,3 Reginald Souleyrette2,3 Nathan Jacobs1

1Department of Computer Science, University of Kentucky

2Department of Civil Engineering, University of Kentucky

3Kentucky Transportation Center, University of Kentucky

Abstract

This paper addresses the task of road safety assessment. An emerging approach for conducting such assessments in the United States is through the US Road Assessment Program (usRAP), which rates roads from highest risk (1 star) to lowest (5 stars). Obtaining these ratings requires manual, fine-grained labeling of roadway features in street-level panoramas, a slow and costly process. We propose to automate this process using a deep convolutional neural network that directly estimates the star rating from a street-level panorama, requiring milliseconds per image at test time. Our network also estimates many other road-level attributes, including curvature, roadside hazards, and the type of median. To support this, we incorporate task-specific attention layers so the network can focus on the panorama regions that are most useful for a particular task. We evaluated our approach on a large dataset of real-world images from two US states. We found that incorporating additional tasks, and using a semi-supervised training approach, significantly reduced overfitting problems, allowed us to optimize more layers of the network, and resulted in higher accuracy.

1 Introduction

At over 35,000 fatalities annually, highway crashes are one of the primary causes of accidental deaths in the US [30]. Data-driven approaches to highway safety have been widely used to target high-risk road segments and intersections through various programs leading to a reduction in the number of fatalities observed [29]. Unfortunately, this reduction has plateaued in recent years. The prospect of automated vehicles will likely have a dramatic effect on further reducing fatalities—perhaps as much as 90% [6]. However, until this technology is mature, policies are in place, and the public adopts machine-driven vehicles, we must rely on other methods to improve safety. In the USA and other developed countries, high crash-rate locations have been predominantly identified and addressed. State and local highway authorities are now turning increasingly to systemic analysis to further drive down the occurrence of highway crashes, injuries, and fatalities. Systemic analysis requires data to identify and assess roadway features and conditions known to increase crash risk.

We address the task of automatically assessing the safety of roadways for drivers. Such ratings can be used by highway authorities to decide, using cost/benefit analysis, where to invest in infrastructure improvements. Automating this typically manual task will enable much more rapid assessment of safety problems of large regions, entire urban areas, counties, and states. In the end, this can save lives and reduce injuries.

An emerging technique for systemically addressing the highway crash problem is represented by the protocols of the United States Road Assessment Program (usRAP). In this process, a trained coder annotates, at regular intervals, various features of the roadway such as roadway width, shoulder characteristics and roadside hazards, and the presence of road protection devices such as guardrail. These annotations are based on either direct observation or from imagery captured in the field and can be used to augment any existing highway inventory of these features. These data are then used to rate the roadway, and in turn, these ratings may be classified into a 5-tier star-rating scale, from highest risk (1 star) to lowest (5 stars). This manual process is laborious and time consuming, and sometimes cost prohibitive. Moreover, the speed and accuracy of the rating process can and does vary across coders and over time.

To automate this manual process, we propose a deep convolutional neural network (CNN) architecture that directly estimates the star rating from a ground-level panorama. See Figure 2 for examples of input panoramas. Note the lack of a physical median, paved shoulder, and sidewalks in the less safe roads. The key features of our approach are:

•

multi-task learning: we find that augmenting the network to support estimating lower-level roadway features improves performance on the main task of estimating the star rating;

•

task-dependent spatial attention layer: this allows the network to focus on particular regions of the panorama that are most useful for a particular task; and

•

unsupervised learning: we add a loss that encourages the star rating distribution to be similar for nearby panoramas, greatly expanding the number of panoramas seen by the network, without requiring any manual annotation.

We evaluate this approach on a large dataset of real-world imagery and find that it outperforms several natural baselines. Lastly, we present an ablation study that demonstrates the benefits of the various components of our approach.

2 Related Work

Our work builds upon work in several areas: general purpose scene understanding, automatic methods for understanding urban areas, and current practice in the assessment of roadway features.

Scene Classification and Image Segmentation

Over the past ten years, the state of the art in scene classification has dramatically improved. Today, most, if not all, of the best performing methods are instances of deep convolutional neural networks (CNNs). For many tasks, these methods can estimate the probability distribution across hundreds of classes in milliseconds at human-level accuracy [8]. A notable example is the Places CNN [33], developed by Zhou et al., which adapts a network that was created for image classification [12]. This network, or similar networks, has been adapted for a variety of tasks, including: horizon line estimation [31], focal length estimation [26], geolocalization [15], and a variety of geo-informative attributes [13].

This ability to adapt methods developed for scene classification and image segmentation to other tasks was one of the main motivations for our work. However, we found that naïvely applying these techniques to the task of panorama understanding did not work well. The main problem is that these methods normally use lower resolution imagery which means they cannot identify small features of the roadways that have a significant impact on the assessment of safety. We propose a CNN architecture that overcomes these problems by incorporating a spatial attention mechanism to support the extraction of small image features.

In some ways, what we propose is similar to the task of semantic segmentation [3], which focuses on estimating properties of individual pixels. The current best methods are all CNN-based and require a densely annotated training set. Constructing such datasets is a labor intensive process. Fortunately, unlike semantic segmentation, we are estimating a high-level attribute of the entire scene so the effort required to construct an annotated dataset is lower. It also means that our CNNs can have a different structure; we extract features at a coarser resolution and have many fewer loss computations. This means faster training and inference.

Urban Perception and High-Definition Mapping

Recently there has been a surge in interest for applying techniques for scene classification [2, 4, 17, 19, 22, 27, 28] and image segmentation to understanding urban areas and transportation infrastructure [5, 16, 25]. The former focuses on higher-level labels, such as perceived safety, population density, or beauty, while the later focuses on finer-grained labels, such as the location on line markings or the presence of sidewalks. Our work, to some extent, combines both of these ideas. However, we focus on estimating the higher-level labels and use finer-grained labels as a form of multi-task learning. Our evaluation demonstrates that by combining these in a single network enables better results.

Current Practice in Roadway Assessment

The Highway Safety Manual (HSM) [10] outlines strategies for conducting quantitative safety analysis, including a predictive method for estimating crash frequency and severity from traffic volume and roadway characteristics. The usRAP Star Rating Protocol is an internationally established methodology for assessing road safety, and is used to assign road protection scores resulting in five-star ratings of road segments. A star rating is partly determined based on the presence of approximately 60 road safety features [7]. More implementation of road safety features entails a higher safety rating, and vice versa. There are separate ratings for vehicle occupants, cyclists, and pedestrians, but we focus on vehicle occupant star ratings in this paper.

3 Approach

We propose a CNN architecture for automatic road safety assessment. We optimize the parameters of this model by minimizing a loss function that combines supervised, multi-task, and unsupervised component losses. We begin by outlining our base architecture, which we use in computing all component loss functions.

3.1 Convolutional Neural Network Architecture

Our base CNN architecture takes as input a street-level, equirectangular panorama (e.g., from Google Street View) and outputs a categorical distribution over a discrete label space. Our focus is on the roadway safety label space, which is defined by usRAP to have five tiers. Other label spaces will be defined in the following section. In all experiments, panoramas are cropped vertically, to reduce the number of distorted sky and ground pixels, and then resized to be $224\times 960$ .

The CNN consists of a portion of the VGG architecture [24], followed by a $1\times 1$ convolution, a spatial attention layer, and a final fully connected layer. See Figure 3 for a visual overview of the full architecture and the various ways we use it to train our model. We use the VGG-16 architecture [24] for low-level feature extraction. It consists of 5 convolutional blocks followed by 3 fully connected layers, totaling 16 individual layers. We remove the fully connected layers and use the output of the last convolutional block, after spatial pooling. We denote this output feature map as $S_{1}$ , which is a $7\times 30$ tensor with 512 channels. $S_{1}$ is then passed to a convolutional layer (ReLU activation function), with kernels of size $1\times 1$ , a stride of 1, and 512 output channels. The resulting tensor, $S_{2}$ , has the following shape: $7\times 30\times 512$ . This gives us a set of mid-level features that we use to predict the safety of a given panorama.

As the target safety rating may depend on where in the image a particular roadway feature is present, we introduce an attention layer to fuse the mid-level features. Specifically, we use a learnable vector $\boldsymbol{a}$ that takes the weighted average of $S_{2}$ across the first two dimensions. Our process is as follows: we flatten the spatial dimensions of $S_{2}$ ( $210\times 512$ ) and multiply it by $\boldsymbol{a}$ ( $1\times 210$ ). This process is akin to global average pooling [14], but with location-specific weights. The output ( $1\times 512$ ) is then passed to a task-dependent fully connected layer with $K$ outputs.

3.2 Loss Function

A key challenge in training large neural network architectures, especially with small datasets, is avoiding overfitting. For this work, we propose to use a combination of supervised, multi-task, and unsupervised learning to train our model. The end result is a total loss function:

[TABLE]

Each component loss function processes panoramas using the base architecture defined in the previous section. In all cases, the parameters of the VGG sub-network and the subsequent $1\times 1$ convolution are tied. The attention and final fully connected layers are independent and, therefore, task specific. The remainder of this section describes the various components of this loss in detail.

3.2.1 Supervised Loss

The first component, $L_{s}$ , of our total loss, $L$ , corresponds to the main goal of our project: estimating the safety of a given roadway from an input panorama. Each panorama is associated with a star rating label, $l\in\{1,\ldots,5\}$ . We apply the network defined in Section 3.1 with $K=5$ outputs, representing a categorical distribution over star ratings. Our first loss, $L_{s}$ , incorporates both a classification and regression component. For classification, we use the standard cross-entropy loss between the predicted distribution, $\hat{y}$ , and target distribution, $y$ :

[TABLE]

where $N$ is the number of training examples. For regression, we use the Cramer distance between $\hat{y}$ and $y$ :

[TABLE]

where $F(x)$ is the cumulative distribution function of $x$ . Each component is weighted by $\lambda_{s^{1}}$ and $\lambda_{s^{2}}$ , respectively:

[TABLE]

3.2.2 Multi-task Loss

The second component, $L_{m}$ , of our total loss, $L$ , represents a set of auxiliary tasks. We selected M auxiliary tasks with discrete label space for learning, specifically: area type, intersection channelization, curvature, upgrade cost, land use (driver/passenger-side), median type, roadside severity (driver/passenger-side distance/object), paved shoulder (driver/passenger-side), intersecting road volume, intersection quality, number of lanes, lane width, quality of curve, road condition, vehicle parking, sidewalk (passenger/driver-side), and facilities for bicycles. All images were annotated by a trained coder as part of the usRAP protocol.

For each new task, the prediction process is very similar to the safety rating task, with its own attention mechanism and final prediction layer. The only difference is the output size of the prediction layer which varies to match the label space of the specific task. To compute $L_{m}$ , we sum the cross-entropy loss across all tasks: $L_{m}=\sum^{M}_{t=1}L_{t}$ where $L_{t}$ is the loss for task $t$ .

3.2.3 Unsupervised Loss

The third component, $L_{u}$ , of our total loss, $L$ , represents Tobler’s First Law of Geography: “Everything is related to everything else, but near things are more related than distant things.” Specifically, we assume that geographically close street-level panoramas should have similar star ratings, and we encourage the network to produce identical output distributions for adjacent panoramas. While this assumption is not always true, we find that it improves the accuracy of our final network.

The key feature of this loss is that it does not require the panoramas to be manually annotated. Therefore, we can greatly expand our training set size by including unsupervised examples. This is important, because due to the small size of the safety rating dataset and a large number of parameters in the network, we found it impossible to update VGG layer weights without overfitting when using only supervised losses.

To define this loss, we first build a set, $U=\{(a_{i},b_{i})\}$ of panorama pairs. The pairs are selected so that they are spatially close (within 50 feet) and along the same road. For each pair, we compute the Cramer distance between their rating predictions, $(\hat{y}_{a},\hat{y}_{b})$ , as in:

[TABLE]

3.3 Implementation Details

Our network is optimized using ADAM [11]. We initialize VGG using weights pre-trained for object classification [21]. We experimented with weights for scene categorization [32] and found that object classification was superior, and both were significantly better than random initialization. We allow the final convolutional block of VGG to optimize with learning rate 0.0001, while all task-specific layers use a learning rate of 0.001. We decay the learning rates three times during training, by a factor of 10 each time.

Through experimentation, we find that optimizing the total loss with $\lambda_{s^{1}}=1$ , $\lambda_{s^{2}}=100$ , $\lambda_{s}=0.1$ , $\lambda_{m}=1$ , and $\lambda_{u}=0.001$ offers the best results. We use ReLU activations throughout, except for attention mechanisms and final output layers, which apply a softmax function. For every trainable layer other than the attention mechanisms, we apply $L_{2}$ regularization to the weights with a scale of 0.0005. We train with a mini-batch size of 16, with each batch consisting of 16 supervised panoramas and labels along with 16 pairs of unsupervised panoramas.

4 Evaluation

We evaluate our methods both quantitatively and qualitatively. Below we describe the datasets used for these experiments, explore the performance of our network through an ablation study, and present an application to safety-aware vehicle routing.

4.1 Datasets

For supervised and multi-task training, we utilize a dataset annotated through the U.S. Road Assessment Program (usRAP), which contains star ratings and auxiliary task labels for 100-meter segments of roads in both Urban and Rural regions. To obtain the labels for each location, a trained coder visually inspects the imagery for a road segment and assigns multi-task labels using a custom user interface. During this process the coder is free to adjust the viewing direction. The star rating for each location is then calculated from the auxiliary labels using the Star Rating Score equations. For more information on dataset annotation methods, please refer to [1] [7] [18].

The Rural area has 1,829 panoramas and the Urban area has 5,459 panoramas, for a total of 7,288 samples. Figure 4 shows scatter plots of panorama locations for each region, color coded by the safety score manually estimated from the imagery.

For unsupervised training data, we uniformly sample road segments from the state surrounding the Rural region and then query for panoramas that are less than 50-feet apart along the segment. The result is a dataset of approximately 36,000 pairs of panoramas.

4.2 Preprocessing

Panorama Processing

For each sample, we download an orbital panorama through Google Street View. We orient panoramas to the heading used during manual coding. This is important because safety ratings are sensitive to the direction of travel. During training only, random augmentation is performed by randomly jittering the direction of travel uniformly between -5 and 5 degrees. Finally, each image is cropped vertically and reshaped to $224\times 960$ . The cropping operation removes the unneeded sky and ground portion of the panorama, and we preserve the aspect ratio of the cropped image when reshaping.

Train/Test Split

To create train/test splits for network training and evaluation, we utilize a stratified sampling method for each region’s data, 90% for train and 10% for test. A validation set (2% of train) is used for model selection. Corresponding splits from the two regions are combined to form the final sets. We have ensured that locations in the test set are at least 300 meters away from locations in the train set.

Class Weighting

The distribution of labels in our training split is very unbalanced, which led to poor fitting to panoramas with labels in the minority groups (1 star roads specifically). To alleviate this issue, we deploy a class weighting strategy on the star rating loss function to proportionally adjust the loss of each sample based on the frequency of its ground truth label.

We first find the weight vector of each star rating class $w$ through the equation below:

[TABLE]

where $w_{l}$ is the weight for class $l$ . With the weight vector $w$ , we modify equation 1 and 2 as follows,

[TABLE]

where $w_{l_{i}}$ is the weight for class $l_{i}$ .

4.3 Ablation Study

We compare our complete architecture with five variants. The simplest method, Baseline, omits the adaptive attention layer and instead equally weights all parts of the input feature map (i.e., global average pooling). The remaining variants include different combinations of adaptive attention, multi-task loss, and unsupervised loss. For each method, we performed a greedy hyperparameters search using our validation set. We initially selected the optimal weighting of $\lambda_{s^{2}}$ relative to $\lambda_{s^{1}}$ by selecting $\lambda_{s^{2}}$ from $(.01,.1,1,10,100)$ . We used the same strategy when adding a new feature, while keeping the existing hyperparameters fixed. We train all networks for 10,000 iterations.

Table 2 displays the macro-averaged accuracy for all methods on the test set, where for each method we compute the per-class accuracy and average the class accuracies. Our method outperforms all other methods, each of which includes a subset of the complete approach. We observe that the Baseline, M1, M2, and M3 methods all achieve similar accuracy. The next method, M4, which combines the per-task attention and multi-task learning, performs significantly better. It seems that the multi-task loss and attention in isolation are not particularly helpful, but when they are combined they lead to improved performance. We also observe that the unsupervised loss is only significantly helpful when combined with per-task attention and the multi-task loss.

Figure 5 shows the test-set confusion matrix for our best method. The results are quite satisfactory except for the 1-star rating category, but that is as expected due to the imbalanced nature of our training dataset, with only 5.6% of all samples being 1-star roads.

4.4 Visualizing Attention

As described in Section 3.2.2, each task has an independent attention mask, $\boldsymbol{a}$ , whose shape ( $1\times 210$ ) corresponds to the flattened spatial dimensions of the feature map, $S_{2}$ , output by the previous convolutional layer. Therefore, a reshaped version of $\boldsymbol{a}$ corresponds to the regions in $S_{2}$ that are important when making predictions for a given task. Figure 6 visualizes the attention mechanism for our main task and several of our auxiliary tasks, where lighter (yellow) regions have higher weight than darker (red) regions. For example, in Figure 6 (e), attention is focused on the left side of the panorama, which makes sense given the task is related to identifying dangerous driver-side objects, such as a drainage ditch or safety barrier.

4.5 Multi-task Evaluation

We evaluate the multi-task prediction performance of our architecture. Table 3 shows the Top-1 macro-averaged accuracy for each of our auxiliary tasks, along with random performance if we sample predictions based on the prior distribution of labels for that task. The rightmost column shows the relative increase in accuracy for each task over random performance, in some cases with a performance gain of almost 50%.

4.6 Safety-Aware Vehicle Routing

While our focus is on road safety assessment for infrastructure planning, our network could be used for other tasks. Here we show how it could be used to suggest less-risky driving routes. We use a GPS routing engine that identifies the optimal path, usually the shortest and simplest, a vehicle should traverse to reach a target destination. Some work has been done to explore semantic routing with respect to scenery [20], carpooling [9], and personalization [23]. We propose routing that employs road safety scores for navigating to a destination, in order to balance speed and safety.

We selected a subset of the panorama test split and used it to influence the Mapzen Valhalla routing engine’s edge cost calculation. From the subset, each safety score was used as an edge factor corresponding to the safety score’s GPS coordinate. When the routing engine searched the GPS graph for a route’s traversal cost, it would identify if a cost factor corresponding to a specific edge existed in the subset. Should a cost factor be present, the edge cost of the traversal would be $c_{edge}=c_{o}*factor$ . The routing engine utilizes the augmented edge costs to determine the optimal route, namely, the lowest cumulative cost route.

The two routes depicted in Figure 7 demonstrate the impact of safety-aware routing in a major US urban area. Figure 7-right shows a less risky, but longer, route chosen by the enhanced routing engine, while Figure 7-left shows the default route. The enhanced routing engine chooses this route to circumvent a low (1 or 2) safety scoring road and instead travel on a high (4 or 5) scoring road. Figure 8-top shows a panorama from the higher risk road. It has numerous issues, including poor pavement condition and small lane widths. Figure 8-bottom shows a panorama from the less risky road. This road has wider lanes, better pavement, and a physical median. While this route may take longer, it clearly traverses a less risky path.

While this is a minor modification to an existing routing engine, we think the ability to optimize for safety over speed could lead to significant reduction in injuries and deaths for vehicle users.

5 Conclusion

In this paper, we introduced an automated approach to estimate roadway safety that is significantly faster than previous methods requiring significant manual effort. We demonstrate how a combination of a spatial attention mechanism, transfer learning, multi-task learning, and unsupervised learning results in the best performance on a large dataset of real-world images. This approach has the potential to dramatically affect the deployment of limited transportation network improvement funds toward locations that will have maximum impact on safety. The outlined approach addresses the main concern of many agencies in deploying systemic analysis (such as usRAP)—cost to collect and process data. As the availability of street-level panoramas is growing rapidly, employing automation techniques could allow many more agencies to take advantage of systemic road safety techniques. For many agencies, there are few options available to prioritize roadway safety investments. Even when and where sufficient data are available, smaller agencies typically lack the expertise to conduct robust safety analysis as described in the HSM [10] for high crash location assessment or in usRAP for systemic study. Our approach could reduce the cost of systemic analysis to these agencies, or help larger agencies assess more of their roads, more frequently. For future contributions, we plan to explore at least three offshoots of this work: 1) apply the proposed method to additional highway safety tasks, 2) integrate overhead imagery, and 3) make assessments using multiple panoramas.

Acknowledgments

We gratefully acknowledge the financial support of NSF CAREER grant IIS-1553116 and computing resources provided by the University of Kentucky Center for Computational Sciences, including a hardware donation from IBM.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] United States Road Assessment Program. http://www.usrap.org . Accessed: 2018-1-13.
2[2] S. M. Arietta, A. A. Efros, R. Ramamoorthi, and M. Agrawala. City forensics: Using visual elements to predict non-visual city attributes. IEEE Transactions on Visualization and Computer Graphics , 20(12):2624–2633, 2014.
3[3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition , 2016.
4[4] A. Dubey, N. Naik, D. Parikh, R. Raskar, and C. A. Hidalgo. Deep learning the city: Quantifying urban perception at a global scale. In European Conference on Computer Vision , 2016.
5[5] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition , 2012.
6[6] B. Gibson. Analysis of autonomous vehicle policies. Technical report, Kentucky Transportation Center, 2017.
7[7] D. Harwood, K. Bauer, D. Gilmore, R. Souleyrette, and Z. Hans. Validation of us road assessment program star rating protocol: Application to safety management of us roads. Transportation Research Record: Journal of the Transportation Research Board , 2147:33–41, 2010.
8[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision , 2015.