An unsupervised approach to Geographical Knowledge Discovery using   street level and street network images

Stephen Law; Mateo Neira

arXiv:1906.11907·cs.CV·October 14, 2019

An unsupervised approach to Geographical Knowledge Discovery using street level and street network images

Stephen Law, Mateo Neira

PDF

TL;DR

This paper introduces an unsupervised method called ConvPCA that extracts meaningful latent features from street-level and network images, aiding geographic knowledge discovery and urban characteristic prediction.

Contribution

It presents a novel unsupervised approach, ConvPCA, for extracting interpretable latent variables from urban images for geographic analysis.

Findings

01

ConvPCA achieves comparable accuracy to traditional dimension reduction methods.

02

Latent components enable meaningful geographical and visual explanations.

03

The approach predicts urban features like street quality effectively.

Abstract

Recent researches have shown the increasing use of machine learn-ing methods in geography and urban analytics, primarily to extract features and patterns from spatial and temporal data using a supervised approach. Researches integrating geographical processes in machine learning models and the use of unsupervised approacheson geographical data for knowledge discovery had been sparse. This research contributes to the ladder, where we show how latent variables learned from unsupervised learning methods on urbanimages can be used for geographic knowledge discovery. In particular, we propose a simple approach called Convolutional-PCA(ConvPCA) which are applied on both street level and street network images to find a set of uncorrelated and ordered visual latentcomponents. The approach allows for meaningful explanations using a combination of geographical and generative visualisations to…

Tables6

Table 1. Table 1. Street Enclosure Results

$a c c u r a c y$	$P C A_{l i n}$	$A E_{l i n}$	$A E_{n o n}$
4 components	41.50%	41.80%	50.96%
8 components	53.64%	54.02%	55.78%
16 components	56.37%	56.24%	56.77%
32 components	58.36%	58.00%	58.62%
64 components	59.17%	58.45%	59.98%

Table 2. Table 2. Street Frontage Results

$a c c u r a c y$	$P C A_{l i n}$	$A E_{l i n}$	$A E_{n o n}$
4 components	61.46%	59.91%	60.96%
8 components	64.10%	62.26%	62.84%
16 components	68.24%	67.17%	67.44%
32 components	69.13%	68.51%	68.71%
64 components	71.41%	71.93%	70.50%

Table 3. Table 3. Street intersection density results

$a c c u r a c y$	$P C A_{l i n}$	$A E_{l i n}$	$A E_{n o n}$
4 components	76.59%	62.90%	73.56%
8 components	77.28%	76.43%	72.01%
16 components	75.54%	75.94%	74.00%
32 components	71.15%	71.65%	76.00%
64 components	69.45%	73.70%	71.83%

Table 4. Table 4. Street closeness centrality results

$a c c u r a c y$	$P C A_{l i n}$	$A E_{l i n}$	$A E_{n o n}$
4 components	54.63%	59.89%	60.23%
8 components	55.41%	59.51%	58.22%
16 components	54.17%	58.19%	59.02%
32 components	51.85%	57.74%	58.92%
64 components	34.33%	52.00%	53.45%

Table 5. Table 5. Global spatial autocorrelation of outputs of min-max perturbations of first 8 PCA’s of street networks

$P C A$	$I o f m i n$	$I o f m a x$
$1^{s t}$ component	0.87	0.88
$2^{n d}$ component	0.96	0.87
$3^{r d}$ component	0.93	0.89
$4^{t h}$ component	0.98	0.94
$5^{t h}$ component	0.99	0.97
$6^{t h}$ component	0.98	0.96
$7^{t h}$ component	0.99	0.98
$8^{t h}$ component	0.99	0.98

Table 6. Table 6. Global spatial autocorrelation of p c a 𝑝 𝑐 𝑎 pca components

$P C A$	$L$
$1^{s t}$ component	0.68
$3^{r d}$ component	0.75

Equations6

\begin{array}[]{l}f_{w}(x)=\sigma(x\ast W)\equiv z\\ \\ g_{u}(z)=\sigma(z\ast U)\end{array}

\begin{array}[]{l}f_{w}(x)=\sigma(x\ast W)\equiv z\\ \\ g_{u}(z)=\sigma(z\ast U)\end{array}

L_{r} = \frac{1}{n} \sum (x_{i} - G_{u} (F_{w} (x_{i})))^{2}

L_{r} = \frac{1}{n} \sum (x_{i} - G_{u} (F_{w} (x_{i})))^{2}

C (u) = \frac{n - 1}{\sum _{v \in V}^{n - 1} d ( v , u )}

C (u) = \frac{n - 1}{\sum _{v \in V}^{n - 1} d ( v , u )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

An unsupervised approach to geographical knowledge discovery using street level and street network images

Stephen Law*∗*

The Alan Turing Institute & University College LondonLondonUK

and

Mateo Neira*∗*

The Alan Turing Institute & University College LondonLondonUK

(2019)

Abstract.

Recent researches have shown the increasing use of machine learning methods in geography and urban analytics, primarily to extract features and patterns from spatial and temporal data using a supervised approach. Researches integrating geographical processes in machine learning models and the use of unsupervised approaches on geographical data for knowledge discovery had been sparse. This research contributes to the ladder, where we show how latent variables learned from unsupervised learning methods on urban images can be used for geographic knowledge discovery. In particular, we propose a simple approach called Convolutional-PCA ( $ConvPCA$ ) which are applied on both street level and street network images to find a set of uncorrelated and ordered visual latent components. The approach allows for meaningful explanations using a combination of geographical and generative visualisations to explore the latent space, and to show how the learned representation can be used to predict urban characteristics such as street quality and street network attributes. The research also finds that the visual components from the $ConvPCA$ model achieves similar accuracy when compared to less interpretable dimension reduction techniques.

urban analytics, unsupervised learning, convolutional neural networks, knowledge discovery, computer vision, machine learning

*∗*Both authors contributed equally to this research.

††journalyear: 2019††copyright: acmcopyright††conference: 3rd ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery ; November 5, 2019; Chicago, IL, USA††booktitle: 3rd ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI’19), November 5, 2019, Chicago, IL, USA††price: 15.00††doi: 10.1145/3356471.3365238††isbn: 978-1-4503-6957-2/19/11††ccs: Computing methodologies Unsupervised learning††ccs: Computing methodologies Computer vision††ccs: Applied computing Architecture (buildings)

1. Introduction

According to (Miller and Han, 2001), Geographic knowledge discovery (GKD) is the process of using computational methods and visualisation to explore spatial databases to discover useful geographic knowledge. Despite the ubiquity of geographically-labelled image data and the subsequent use of machine learning methods to retrieve geographical information, the majority of the researches have focused mainly on the use of supervised learning approaches. For example, on the use of convolutional neural networks $CNN$ to make inferences on perceived safety (Naik et al., 2014), house price (Law et al., 2018) and scenicness (Seresinhe et al., 2017). These researches required effort on both collecting the data and on learning a specific objective. As a result, there is an opportunity to use urban image information in an unsupervised and scalable manner. Our research question therefore implies, what compact latent representation can be learnt from urban images without supervision, how can this information be described and what is this representation useful for?

This study contributes to these research questions and proposes an unsupervised learning model called Convolutional-PCA ( $ConvPCA$ ) that summarises urban imagery into a set of lower dimensional uncorrelated latent components. We apply this method to two case studies namely for: Google StreetView images(Google, 2018) and OpenStreetsMaps ( $OSM$ ) street network images (OpenStreetMap contributors, 2019). In the experiments, we first map and visualise the extremes of the responses geographically and generate new synthetic images by perturbing the values of each component whilst holding all the other component values constant. We then study the latent components by using it to predict different geographical datasets such as street enclosure and street frontage type for the StreetView image data and network density and network centrality for the street network image data. The research finds that the visual components from the $ConvPCA$ model have interpretable meanings with predictive abilities to geographical labelled data using a compact representation. The research also finds that the visual components from the $ConvPCA$ model achieve a similar accuracy to other dimension reduction techniques such as an autoencoder while retaining its interpretability. From a machine learning perspective, we gain new knowledge about these latent components which contribute to the recent efforts in linking the two disciplines (Klemmer et al., 2019) (Reichstein et al., 2018).

2. Related Works

2.1. StreetViews

Street-level images have been used extensively in intelligent transportation systems research. Specifically on the deployment of autonomous vehicles where deep convolutional neural networks ( $CNN$ ) had been applied for urban scene understanding (Ros et al., 2016). More recently, we have also seen the use of generative models such as Generative Adversarial Networks ( $GAN$ ) to synthetically create street scenes that can be used to train self-driving vehicles (Wang et al., 2018). Despite its popularity in transportation research, there had been limited effort on using street-level imagery to retrieve geographical information and for studying urban planning problems. One such example is StreetScore where (Naik et al., 2014) collected subjective human perception data from street images through a crowd-sourced survey (Place Pulse 2.0) which are then used to predict the perceived safety of a place (Dubey et al., 2016). Another example is the work of Gebru et al. (Gebru et al., 2017) whom extracted features such as car types from Google StreetView images to predict the income, race, education, and voting patterns for cities in the United States. We have also seen the use of urban images (Seresinhe et al., 2017) to predict scenicness ratings which were found to affect urban wellbeing as well as the type of urban frontages (Law et al., 2018) which is an important urban design attribute. These fairly recent efforts relied on extracting visual features from street-level images which are then related to different socio-economic factors. In contrast to these works, Law et al (Law et al., 2019) extracted a visual response from urban images by directly estimating house prices. A distinguishing difference here is that the method extracted a visual scalar response that corresponds directly to house price, which can be visualised and interpreted in traditional econometric model. In summary, these recent researches focused on learning a set of visual features or response from urban imagery using a supervised learning approach. Our research extends from this work where we propose a two stage method in learning a set of generic and compact latent visual latent components from an unsupervised learning approach. We then, through a set of analysis and experiments interrogate, describe and explore these components for geographic knowledge discovery.

2.2. Street Networks

In the case of street networks, there has been a long-standing effort to analyse and to understand them from a quantitative perspective and to generate models that are able to reproduce their empirical features. Previous works have largely been based on complexity theory and network science (Boeing, 2018a; Louf and Barthelemy, 2014; Strano et al., 2012). This includes analyzing the spatial configuration of urban street networks (Hillier, 2007) and analyzing urban systems from an information theoretic perspective (Batty, 2005).

More recently, there has been a growing interest in applying machine learning methods to extract useful information from the vast amount of data now openly available from sources such as $OSM$ . Examples of such works have used neural networks to classify street network patterns of different cities, where two different methods had been used. The first used a Convolutional Autoencoder $CAE$ to create dense urban vectors that are used to cluster similar urban morphologies using a self-organinzing map (Moosavi, 2017). The second approach used a Variational Auto Encoder $VAE$ to measure similarity across different networks (Kempinska and Roberto, 2019).

Generative models have also been used to generate synthetic street networks. Variational Autoencoder trained on street network images has been use by sampling from the latent space $z$ (Kempinska and Roberto, 2019), however the resolution of these are low, and fail to capture fine grain detail of local streets. A Generative Adversarial Network such as $StreetGAN$ (Hartmann et al., 2017) has also been proposed to generate a multitude of arbitrary sized street networks that faithfully reproduce the style of the original dataset.

Current limitations in the use of VAE, CAE, and GANs on street networks lie in the interpretability of the latent space and its relationship to geometrical and topological properties used in established network measures. Our research contributes to these researches by developing a methodology to interpret the lower-dimensional embedding learnt by a convolution autoencoder. This allows for greater interpretation of the unsupervised model, as well as providing some initial results as to the relationship between the embedding and the established network measures.

3. Methods and Materials

3.1. Convolutional-PCA

We propose here the Convolutional-PCA ( $ConvPCA$ ), which combines a type of Convolutional Neural Network called the Convolutional Autoencoder ( $CAE$ ) with a linear PCA ( $PCA_{lin}$ ) to retrieve a set of latent visual components that summarise a StreetView image or a street networks image. We first describe the $CAE$ followed by the $PCA_{lin}$ . Deep Convolutional Autoencoder $CAE$ is an unsupervised method that uses convolutional neural network ( $CNN$ ) to learn a compact representation or a set of visual features (Masci et al., 2011; Bengio et al., 2007; Guo et al., 2017). Deep CAE consists of two set of layers, an encoder $f_{w}(\cdot)$ and a decoder $g_{u}(\cdot)$

[TABLE]

where $x$ is the input vector, $z$ is the latent features, $\ast$ is the convolution operator that extract image features and $\sigma$ is a $ReLU$ activation function to model nonlinearity in the neural network. These convolutional layers can be stacked sequentially where the encoding layers reduce the dimension to a latent variable $z$ while the decoding layer increases the dimension back to image space. The sequential architecture can be seen in figure 1. Following (Masci et al., 2011), the parameters of the encoder $z=F_{w}(x)$ and the decoder $x^{\prime}=G_{u}$ are updated by minimising the reconstruction losses between $x$ and $x^{\prime}=G_{u}(F_{w}(x_{i}))$ .

[TABLE]

In our research, we further compress the latent visual features by applying a linear principal component analysis $PCA_{lin}$ which summarises the visual feature $z$ into a set of linearly uncorrelated variables $v$ . To compute $v$ , we first standardise $z$ and compute the eigenvectors and eigenvalues of the feature covariance matrix $P$ . We then take the eigenvectors to calculate the full principal component decomposition of $z$ , given by $V=XW$ , where $W$ is the eigenvector matrix. $V$ can be re-projected back on to the original latent space produced by the encoder before passing in to the decoder to reconstruct the images. This process allows us to:

•

Retrieve a set of uncorrelated and ordered visual latent components that can be visualised and mapped geographically.

•

Make changes to individual components and decoding it to generate a synthetic image.

•

Relate learnt visual latent components to geographical labelled data.

To discover new geographical knowledge and in testing the usefulness of the latent representation, we will visualise these components by generating new images when perturbing in the $PCA_{lin}$ space and also in mapping them. We will then use these components for down stream tasks such as prediction and classification. The process can be seen in figure 1. Further research is required to validate the meaning of these visual latent components quantitatively and in comparing it with latent components extracted from other methods. These limitations will be further elaborated in the discussion section.

$PCA_{lin}$ is selected as it is a well principled dimension reduction technique that learns a compact and meaningful representation with uncorrelated and ordered components. Approaches such as autoencoders can find a similar representation but without the same interpretability (Ladjal et al., 2019). In the prediction experiment, we will study and compare the extent a $PCA_{lin}$ , a linear autoencoder and a non-linear autoencoder are able to learn a compact representation for different down-stream tasks. A benefit of finding uncorrelated ordered components is that these factors can be inserted into a generalised linear modelling framework, whose coefficients can be interpreted. Such research are not explored in this paper but the coefficients are meaningful for example in econometric studies (Law et al., 2019).

3.2. Materials

We collected two datasets. The first dataset is street images taken from the Google StreetView API (Google, 2018)111©2017 Google Inc.. Similar to (Law et al., 2018), we collected a front-facing image for each street in the Greater London Area. To collect the dataset, we constructed a line-graph from the street network of London (OS Meridian line2 dataset (Survey, 2017)). We then take the geographic median and the azimuth of the street edge to give both the location and the bearing when collecting each image. We collected a total of $110,493$ street images in London. For more details in the data collection method please see (Law et al., 2018). Figure 3 illustrates typical images from the dataset.

The second dataset is the street network dataset taken from OpenStreetsMaps (OpenStreetMap contributors, 2019), we query all the cities and towns for a total of $107,973$ . For each city and town we download the street network within a 1.5km x 1.5Km box at the centroid of each place using osmnx (Boeing, 2017), as shown in Figure 4. For each 1.5km x 1.5km grid we retrieve a graph $G=(V,E)$ where each vertex $v$ corresponds to a street intersection and $e$ edge corresponds to a street segment. For each $G$ , we rasterise it into a 256 x 256 pixel image as shown in Figure 5. We also calculate basic network statistics (Boeing, 2018b) such as network centrality and network density that are later used to test the learnt features of the images.

4. Experimental Results

In order to discover new information and interpretations from these visual latent components, we will visualise these components and to use these factors for predictions on both the street level and street network dataset.

4.1. streetview images

4.1.1. Visualisation experiments

The $ConvPCA$ first learns a mapping from a three channel street level images (224 x 224 x 3) down to a lower dimensional embedding (4,096 dimensions) using a convolutional autoencoder $CAE$ . The lower dimension embedding is then further summarised into a set of uncorrelated components using $PCA_{lin}$ . For the StreetView images, we adopted a $VGG$ (Simonyan and Zisserman, 2014) like architecture where we keep the kernel size and filter numbers constant across both the encoder and decoder.

To show the results, we first plot the images with the highest and lowest principal component values for interpretation. In this case, component $pca$ 7 has blank facade in one of the extremes and natural scenes in the other. $pca$ 10 and $pca$ 30 shows a tunnel space in one extreme and a mixture of urban scenes in the other. While $pca$ 14 has buildings in one extreme and blank facades in the other. Lower rank components that capture lower variance seem to be showing less patterns and therefore not visualised.

To interrogate the results of the primary components, we focus on visualising $pca$ 1 and $pca$ 3 geographically in figure 6. The images plotted above the map show the two extremes of the visual latent components. We can descriptively interpret these two visual components $v_{1}$ and $v_{3}$ as proxy measures for different type of street urbanity. We also show through global spatial autocorrelation analysis these components exhibit strong spatial dependence. Please see the $appendix$ for more details on the spatial analysis.

We then visualised one of the StreetView images and perturbed each of the two principal components while holding all the other component values constant before passing it to the decoder to generate a synthetic image. More formally, for each $pca$ we create a mean vector $\hat{v}$ , where we keep all values in $\hat{v}$ constant and vary only the individual $pca$ before passing it to the decoder to generate a synthetic image. Figure 9 shows when we perturbed $pca$ 1 of a typical StreetView image while holding all other principal components constant, building details tends to increase, and when we perturbed the same image in the other axis, building details tend to reduced. In contrast, when we perturbed $pca$ 3 of the same StreetView image whilst holding all other principal components constant, trees started appearing and when we perturbed the same image to the other axis buildings becomes more prominent and the trees disappeared. The result also shows that the streets are widening in one of the axis while the car is disappearing in the other axis for $pca$ 1. This result suggests, each component is related to a quality measure of street urbanity and is possibly capturing multiple correlated visual features of a StreetView image. As a result, in terms of controllability, the approach seems not able to disentangle highly correlated features. These descriptive results show geographical and generative visualisations are useful approaches to discover meanings from these visual latent components. However, more researches is needed to validate the meaning of these visual components quantitatively.

4.1.2. Prediction experiment

In order to demonstrate the usefulness of these visual latent components for different down-stream tasks, we constructed two separate models to map the latent visual components $V$ to street enclosure (regression task) and street frontage quality (classification task). We compare $PCA_{lin}$ to two other dimension reduction techniques in retrieving the latent visual component, a linear autoencoder $AE_{lin}$ and a nonlinear autoencoder $AE_{non}$ . The linear autoencoder uses a linear activation function with one bottleneck layer that outputs $V$ . The nonlinear autoencoder on the other hand uses the $ReLu$ activation function with three hidden layers where the first and the third hidden layer are the encoding and decoding layer with $512$ neurons and the second layer being the bottleneck layer that outputs $V$ .

Street enclosure here is defined as the average height of the building of a street divided by the average width between the buildings of the same street as illustrated in fig 10. The street enclosure is calculated by segmenting the streets from Ordnance Survey data (Survey, 2017) every $40m$ . For each street segment $S$ , we calculate the geographic median $S_{c}$ and the azimuth $S_{\alpha}$ , and create a new line that is perpendicular to $S$ at the point $S_{c}$ . The perpendicular line $S_{\bot}$ is used to create the street profile by intersecting it with the closest building on either side of the street and querying the associated height attribute, this is used to calculate the street enclosure as building height to street width ratio $enc=\overline{h}/w$ . Please see (Neira and Narvaez, 2019) for additional details.

Street frontage types here is defined with four frontage categories namely, active frontage on both sides of the street, active frontage on one side of the street, non-active frontage and non urban frontage. This dataset was manually compiled and studied from a previous study. Please see (Law et al., 2018) for additional details.

We split the dataset randomly into a train (70%), validation (15%) and test set (15%). We then train a multi-layer perceptron $F(\cdot)$ to predict street enclosure and street frontage types from the visual latent components $V$ as inputs , parameterized by a set of weights $W_{v}$ .

The multi-layer perceptron ( $mlp$ ) here is defined as a fully connected neural network with three hidden layers. The first fully connected layer has 64 hidden nodes, while the second layer has 32 hidden nodes and the third layer has 16 hidden nodes. A dropout layer ( $0.2$ ) and a $l1$ regularisation was added in the final activation layer for better generalisation. To test the importance of the visual components with respect to the model accuracy, we constructed five different models based on the number of components [4,8,16,32,64] using the three dimension reduction techniques. This results in 15 models in total.

We train the street enclosure model to minimize the mean squared error $mse$ on a training set, using the ADAM (Kingma and Ba, 2014) optimizer with an initial learning rate set at 0.001. We then report the mean squared error ( $MSE$ ) and the coefficient of determination $R^{2}$ between the model prediction and the observed street enclosure for the spatially random test-set. Similarly, we train the street frontage model to minimise the categorical cross-entropy losses on the training set, using ADAM (Kingma and Ba, 2014) optimiser with an initial learning rate set at 0.001. We then report the cross entropy losses and the accuracy which is simply the sum of correctly predicted frontage class over all samples. All the experiments are conducted with the Keras library (Chollet, 2015) using a Tensorflow (Abadi et al., 2015) back-end.

The results in table 1 shows the $losses$ and $accuracy$ of the three dimension reduction techniques when predicting street enclosure for a spatially random test-set. The model with 64 components achieve 60% accuracy, while the model with 4 components achieve 40-50% accuracy. The result shows, we can achieve similar levels of accuracy with $PCA_{lin}$ when compared to both $AE_{lin}$ and $AE_{non}$ . Similarly, the results in table 2 shows the $losses$ and $accuracy$ of the three dimension reduction techniques when predicting street frontage quality for a spatially random test-set. The results show a model with more components achieve a higher accuracy ( $70\%$ ) than one with less and that $PCA_{lin}$ achieves comparable accuracy to $AE_{lin}$ and $AE_{non}$ . These results suggest, the convolutional layers are possibly capturing some of the non-linear effects between the different image features in the data. As a result, a linear dimension reduction technique such as $PCA_{lin}$ , is able to learn a compact representation of the latent variable $z$ which captures similar variance for two predictive tasks when compared to the autoencoders while retaining its interpretability.

4.2. street network

4.2.1. Visualisation experiments

For the street network case study, the trained convolutional autoencoder learned a mapping from the space of street network images (256 x 256 x 1 or 65,536 dimensions) to a lower dimensional latent space (640 dimensions) which are then further summarised into a set of linearly uncorrelated variables by applying ( $PCA_{lin}$ ). By plotting out the street network images with the lowest to highest values of each component we can start to interpret the learnt latent space. In figure 12, we show the first five. These plots all relate to density of streets in different spatialised regions. The first $pca$ encodes general density, while $pca$ 2-5 encode spatialised densities (left-right, top-bottom, center-periphery, diagonals) respectively.

To make it easier to interpret each $pca$ we create a mean vector $\hat{v}$ , where we keep all values in $\hat{v}$ constant and vary only the $pca$ before passing it to the decoder to create a synthetic image. In figure 13, we show a subset of the different latent visual components encoded by the $pca$ values. We show that the first 10 $pca$ encode regions of spatialised density. We confirm the clustered spatial structure of these component through a spatial autocorrelation tests. The results of the test are shown in the $appendix$ for the first 8 pca perturbations. $pca$ 11-50 encode global structure of the network (coarse grain detail), while $pca$ 50-640 encode local structure of the network (finer grain detail).

By mapping the values of the principal components we can further test spatial patterns that they might encode. With just the first principal component of the latent space we are able to differentiate street network densities across the city of London. Figure 14 shows central London has higher street density than outer London.

4.2.2. Prediction experiment

Lastly, we test the ability these encoding can capture network features by using them to predict two network statistics: intersection density and closeness centrality. To do so, we first select a number of cities from our dataset where we retrieve its street network graphs $G=(V,E)$ . For each graph, we calculate the closeness centrality of its nodes $u\in V$ through:

[TABLE]

where $d(v,u)$ is the shortest weighted path between $u$ and $v$ and $n$ is the total number of nodes in the graph. We then create a continuous $1.5X1.5km$ rectangular grid over each city graph. For each grid cell, we define intersection density as the number of nodes inside the cell divided by the surface area, and the closeness centrality as its median values within each cell.

The data is then split into a train (70%), validation (15%), and test (15%) set. We train a $mlp$ $F(\cdot)$ to predict both the intersection density and median closeness centrality for each grid cell for all our street network graphs from the visual latent components $v$ . The $mlp$ here is defined by a fully connected neural network with two hidden layers, and a dropout layer (0.2) before the final activation. We define five different models based on the number of components [4,8,16,32,64] using the three dimension reduction techniques described in the methods section.

The results in Table 3 show the $mse$ and $R^{2}$ for street intersection density using different number of $pca$ components. With just a few components we are able to achieve an accuracy of $77\%$ with a $PCA_{lin}$ model for the spatially random test-set on spatial features of the graph (intersection density), achieving slightly better results than both $AE_{lin}$ and $AE_{non}$ .

In the case of the median closeness centrality, shown in Table 4, we achieve a $R^{2}$ of $55\%$ with a $PCA_{lin}$ , showing we can achieve similar levels of accuracy to the other dimensionality reduction techniques. The difference in results between the intersection density and the median closeness centrality predictions is most likely due to the fact the while the intersection density is a local attribute thus can be entirely captured through the local graph structure within the grid cell, closeness centrality is dependent on the global structure of the graph of the entire city. Despite this, the model is informative and is still able to capture a significant portion of the variance of the closeness value with few components of the local graph structure. Further research is needed to investigate how much of the global structure can be inferred by local attributes.

5. Discussion and Conclusion

We have presented a simple but novel unsupervised approach to extract and interrogate visual latent components from urban images. This exploratory research sits in contrast to previous works which focused on supervised learning (Naik et al., 2014; Law et al., 2018; Gebru et al., 2017; Seresinhe et al., 2017). Through geographic mapping, generative visualisations, and prediction experiments we were able to retrieve initial meaning from these visual latent components. With the increasing availability of large scale unlabelled image data, research into learning a compact representation automatically from geographical data will become increasingly useful.

In the case of the street level images, by mapping the visual latent components and generating synthetic images by perturbing its components, we were able to discover descriptive meaning from the data of which two of the primary components could be measures in describing street urbanity. We also found the lower dimensional latent components are able to predict two different generic urban characteristics such as street enclosure and street frontage type. The predictive accuracy for street frontage type is not as high as those using a purely supervised approach (Law et al., 2018) but the results suggest a useful and generalisable representation can be learnt for different tasks. Despite the positive results, further exploration is necessary. For example, research is needed to relate the principal components to humanly labelled data describing the perception of street quality (Naik et al., 2014). The results can validate the meaning of these components quantitatively. To confirm the usefulness of the representation learnt, further research is also needed in comparing the visual latent components from unsupervised visual features with the visual latent components from supervised visual features (ie. Places365 database (Zhou et al., 2014)). Future researches are also needed on a) creating more realistic reconstructions by using generative models such as $VAE$ or $GAN$ b) developing quantitative methods to systematically disentangle and control interpretable latent components and c) conducting future research and designing experiments on semi-supervised learning and multi-task learning tasks.

In the case of the street networks, although the model is able to predict road network density and median closeness centrality, it fails to capture more complex street network features, we believe this is because the self-organized pattern of street networks is the result of both geometrical order/disorder as well as local rules of optimality. Through rasterising the street networks, the explicit topological data of the graph is lost, and the model is not able to recover this quality from the image alone. Future works can explore ways to incorporate topological properties of the networks into the model. Recent advances in graph neural networks provide promising directions that would allow both topological and geometric properties to be incorporated into the model, this would allow a richer representation of the street network as both local connectivity structure and their spatial embedding could be preserved. Despite its many limitations, there are benefits to such as approach where traditional network measures can sometimes be computationally expensive, for example $betweennesscentrality$ has a time complexity of $O(nm+n^{2}logn)$ and many spectral properties require eigenvalue decomposition of the graph laplacian matrix to be computed, with a time complexity of $O(n^{3})$ . A model that could approximate these parameters in an efficient manner could prove useful for varied applications, such as characterizing street networks across the world.

An immediate implication of the study, is that by learning a useful and compact representation from urban images, we can use this information immediately for other down-stream geographical tasks such as in prediction and classification. Conversely, this can reduce compute time and data collection costs significantly. More importantly though, the exploratory knowledge discovery process of using a combination of visualisation and inference, can shed new information about these non-linear methods such as neural networks and higher dimensional complex datasets such as images. To conclude, this research contributes to recent efforts in linking the disciplines of geography and machine learning. On the one hand, we find meaning from the visual latent components of street level and street network images. On the other hand, we also demonstrate how geographical datasets and visualisation techniques can be useful to enrich our understanding of machine learning methods.

6. Appendix

In the appendix, we describe for both the StreetView case study and the Street Network case study, the architecture of the Convolutional Autoencoder, the stacked autoencoders, the multi-layer-perceptron ( $mlp$ ) and the spatial autocorrelation analysis.

6.1. StreetView architecture

6.1.1. Convolutional Auotoencoder architecture

For the Convolutional Autoencoder of StreetView, the input is a fixed sized $224x224$ three channel coloured image. We adopted a simplified convolution blocks from $VGG$ (Simonyan and Zisserman, 2014) as the basis of the architecture where we keep the kernel size and filter numbers constant across both the encoder and decoder. Let $Ck$ denote a Convolution-ReLU layer with $k$ filters and $Cdk$ denotes a Convolutional-ReLu-Upsample layer with $k$ filters. All convolutions are $3\texttimes 3$ spatial filters applied with stride of 1.

encoder:C64-C64-C128-C128-C256-C256-C512-C512-CC512-C512-C512-CC512

decoder:CD512-CD512-CD512-C512-C512-C512-C256-C256-C256-C128-C128-C64-C64

6.1.2. Stacked Autoencoder architecture

For the stacked Autoencoder, where we summarise the latent variable $z$ to its latent component $v$ , we applied two forms of the autoencoder namely a linear autoencoder and a nonlinear autoencoder. The linear autoencoder uses linear activation functions with one bottleneck layer that outputs $V$ . The nonlinear autoencoder on the otherhand uses the $ReLu$ activation function with three hidden layers where the first and the third hidden layer are the encoding and decoding layer with $512$ neurons and the second layer being the bottleneck layer that outputs $V$ . Let $Dk$ denote a Dense-ReLU layer with $k$ filters and $N$ as the number of components in the bottleneck layer.

linear: 4096-N-4096

non-linear: 4096-D512-N-D512-4096

6.1.3. Multi-layer Perceptron

The multi-layer perceptron $mlp$ here is defined as a fully connected neural network with three hidden layers. The first fully connected layer has 64 hidden nodes, the second has 32 hidden nodes, while the third layer has 16 hidden nodes. A dropout layer ( $0.2$ ) and $l1$ regularisation was added in the final activation layer. We constructed five different models based on the number of components [4,8,16,32,64] and based on the three dimension reduction techniques resulting in a total of 15 models. Let $Dk$ denote a Dense-ReLu layer with $k$ number of neurons and $N$ denote the shape of the visual latent component. [4,8,16,32,64].

Multi-layer perceptron: N-D64-D32-D16-1

6.2. Street network architecture

6.2.1. Convolutional Auotoencoder architecture

For the Convolutional Autoencoder of street network data, the input is a fixed sized $256x256$ single channel gray-scale. We use a stack of convolutional-ReLu layers and transposed convolutional layers, with a fixed small receptive field: $3X3$ and a convolution stride fixed to 2 pixels. Let $Ck$ denote a Convolution-ReLU layer with $k$ filters and $TCk$ denotes a Transposed-Convolution-ReLu layer with $k$ filters.

encoder:C15-C15-C15-C10-C10

decoder:TC10-TC10-TC15-TC15-TC1

6.2.2. Stacked Autoencoder architecture

For the stacked Autoencoder, where we summarise the latent variable $z$ to its latent component $v$ , we applied two forms of the autoencoder namely a linear autoencoder and a nonlinear autoencoder. The linear autoencoder uses linear activation functions with one bottleneck layer that outputs $V$ . The nonlinear autoencoder on the otherhand uses the $ReLu$ activation function with three hidden layers where the first and the third hidden layer are the encoding and decoding layer with $128$ neurons and the second layer being the bottleneck layer that outputs $V$ . Let $Dk$ denote a Dense-ReLU layer with $k$ filters and $N$ as the number of components in the bottleneck layer.

linear: 640-N-640

non-linear: 640-D128-N-D128-640

6.2.3. Multi-layer Perceptron

The multi-layer perceptron $mlp$ here is defined as a fully connected neural network with two hidden layers. The first fully connected layer has 32 hidden nodes, while the second layer has 16 hidden nodes. A dropout layer ( $0.2$ ) was added before the final activation layer and a $l1$ regularisation was added in the final activation layer. We constructed five different models based on the number of components [4,8,16,32,64] and based on the three dimension reduction techniques. Let $Dk$ denote a Dense-ReLu layer with $k$ number of neurons and $N$ denote the shape of the visual latent component. [4,8,16,32,64].

Multi-layer perceptron: N-D32-D16-1

6.3. Global Spatial Autocorrelation structure

6.3.1. Street network images

In the case of the rasterized street network data, we test if the latent components capture strong local spatial inter-dependencies. This can be examined by measuring the autocorrelation of pixels with its local neighbours when perturbing the principal components of an average image. For our purposes we assume that the output of our $ConvPCA$ $I^{\prime}$ follow some spatial process $y\sim f(c)$ , where $y=vec(I^{\prime})$ and $c$ is a vector indexing the $y_{i}$ pixel values in the output $I^{\prime}$ . The local spatial autocorrelation $L_{i}=L(y_{i})$ is computed as:

$L_{i}=(n-1)\frac{y_{i}-\bar{y}}{\sum_{j=1,j\neq i}(y_{j}-\bar{y})^{2}}\sum_{j=1,j\neq i}w_{i,j}(y_{j}-\bar{y})$

where $\bar{y}$ represents the mean of $y_{i}$ ’s and $w_{i,j}$ are components of a weight matrix indicating membership of the local neighbourhood set between pixels $i$ and $j$ . Below, we show the results of the global spatial autocorrelation $L=\sum_{i}L_{i}$ of the max and minimum perturbations of the first 8 $pca$ values and their corresponding $I^{\prime}s$ .

6.3.2. StreetView images

In the case of the street level images, we test if the latent components exhibit strong geographical dependencies. This can be examined by measuring the spatial autocorrelation between a street component value with its local neighbours, in this case defined by its $8^{th}$ nearest local neighbours. The global Moran’s I $L$ for the two primary components are calculated. The result shows a strong spatial dependence of the visual latent component values at the street level.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abadi et al . (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke,
3Batty (2005) Michael Batty. 2005. Cities and complexity: understanding cities through cellular automata, agent-based models and fractals.
4Bengio et al . (2007) Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy Layer-Wise Training of Deep Networks. In Advances in Neural Information Processing Systems 19 , B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.). MIT Press, 153–160. http://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf
5Boeing (2017) Geoff Boeing. 2017. OS Mnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks. Computers, Environment and Urban Systems 65 (2017), 126–139.
6Boeing (2018 a) Geoff Boeing. 2018 a. Measuring the Complexity of Urban Form and Design. October (2018), 1–22.
7Boeing (2018 b) Geoff Boeing. 2018 b. A multi-scale analysis of 27,000 urban street networks: Every US city, town, urbanized area, and Zillow neighborhood. Environment and Planning B: Urban Analytics and City Science (2018), 2399808318784595.
8Chollet (2015) François Chollet. 2015. keras. https://github.com/fchollet/keras .