Understanding urban landuse from the above and ground perspectives: a   deep learning, multimodal solution

Shivangi Srivastava; John E. Vargas-Mu\~noz; Devis Tuia

arXiv:1905.01752·cs.CV·May 7, 2019

Understanding urban landuse from the above and ground perspectives: a deep learning, multimodal solution

Shivangi Srivastava, John E. Vargas-Mu\~noz, Devis Tuia

PDF

TL;DR

This paper presents a deep learning multimodal approach combining overhead and ground-based imagery to automate urban landuse mapping, improving accuracy and scalability for urban planning applications.

Contribution

The study introduces an end-to-end trainable multimodal CNN that integrates Google Maps and Street View images for landuse classification, demonstrating superior accuracy and generalization across cities.

Findings

01

Multimodal model outperforms single-modality methods in accuracy.

02

Model generalizes well to different cities beyond training area.

03

Approach is scalable using widely available data sources.

Abstract

Landuse characterization is important for urban planning. It is traditionally performed with field surveys or manual photo interpretation, two practices that are time-consuming and labor-intensive. Therefore, we aim to automate landuse mapping at the urban-object level with a deep learning approach based on data from multiple sources (or modalities). We consider two image modalities: overhead imagery from Google Maps and ensembles of ground-based pictures (side-views) per urban-object from Google Street View (GSV). These modalities bring complementary visual information pertaining to the urban-objects. We propose an end-to-end trainable model, which uses OpenStreetMap annotations as labels. The model can accommodate a variable number of GSV pictures for the ground-based branch and can also function in the absence of ground pictures at prediction time. We test the effectiveness of our…

Tables3

Table 1. Table 1: Accuracy scores for our proposed Multimodal CNN model and two unimodal CNN models (OH: overhead imagery, GSV: Google Street View ground based images, rGSV: GSV feature vectors retrieved through the CCA algorithm . OA: overall accuracy; AA: average accuracy) for the Île-de-France dataset

	Data source(s)		Metric
Model Name	Train	Test	OA	AA
VGG16 [40]	OH	OH	67.48 $\pm$ 0.57	62.67 $\pm$ 1.39
VIS-CNN with Avg [34]	GSV	GSV	62.52 $\pm$ 1.12	60.24 $\pm$ 1.71
Multimodal CNN	OH, GSV	OH, GSV	73.44 $\pm$ 0.96	70.30 $\pm$ 2.59
Multimodal CNN + CCA	OH, GSV	OH, rGSV	71.78 $\pm$ 1.02	65.65 $\pm$ 1.71

Table 2. Table 2: Accuracy scores for our proposed Multimodal CNN and VIS-CNN using different base models (ResNet50 and AlexNet instead of VGG16) for Île-de-France. OH: overhead imagery, GSV: Google Street View ground based images. OA: overall accuracy; AA: average accuracy).

	Data source(s)		Metric
Model Name	Train	Test	OA	AA
AlexNet [49]	OH	OH	63.42 $\pm$ 1.35	57.45 $\pm$ 1.44
ResNet50 [51]	OH	OH	67.53 $\pm$ 1.07	64.18 $\pm$ 1.54
VIS-CNN with Avg, AlexNet	GSV	GSV	57.13 $\pm$ 1.18	54.10 $\pm$ 0.82
VIS-CNN with Avg, ResNet50	GSV	GSV	54.60 $\pm$ 2.62	54.95 $\pm$ 3.81
Multimodal CNN, AlexNet	OH, GSV	OH, GSV	69.21 $\pm$ 0.64	66.44 $\pm$ 0.92
Multimodal CNN, ResNet50	OH, GSV	OH, GSV	68.96 $\pm$ 0.89	67.25 $\pm$ 1.44

Table 3. Table 3: Accuracy scores of the proposed Multimodal CNN model and two unimodal CNN models for the city of Nantes

Base Model	Data Source (Test)	OA	AA
VGG16	OH	70.94 $\pm$ 0.44	53.9 $\pm$ 1.13
VIS-CNN with Avg	GSV	58.54 $\pm$ 0.72	52.11 $\pm$ 0.80
Multimodal CNN	OH, GSV	75.07 $\pm$ 1.10	62.91 $\pm$ 0.75

Equations13

g (u)_{max}^{j}

g (u)_{max}^{j}

g (u)_{avg}^{j}

L=\frac{1}{N}\sum_{u=1}^{N}\Biggl{[}-\sigma(\widehat{l_{u}}=l_{u}|\mathbf{x}_{u}^{1},\ldots,\mathbf{x}_{u}^{N_{u}},\mathbf{o}_{u})+\log\biggl{(}\sum_{k=1}^{K}\exp(\sigma(\widehat{l_{u}}=k|\mathbf{x}_{u}^{1},\ldots,\mathbf{x}_{u}^{N_{u}},\mathbf{o}_{u}))\biggr{)}\Biggr{]}\,,

L=\frac{1}{N}\sum_{u=1}^{N}\Biggl{[}-\sigma(\widehat{l_{u}}=l_{u}|\mathbf{x}_{u}^{1},\ldots,\mathbf{x}_{u}^{N_{u}},\mathbf{o}_{u})+\log\biggl{(}\sum_{k=1}^{K}\exp(\sigma(\widehat{l_{u}}=k|\mathbf{x}_{u}^{1},\ldots,\mathbf{x}_{u}^{N_{u}},\mathbf{o}_{u}))\biggr{)}\Biggr{]}\,,

W_{1}, W_{2}, W_{3} min i, j = 1 \sum 3 ∥ X_{i} W_{i} - X_{j} W_{j} ∥_{F}^{2},

W_{1}, W_{2}, W_{3} min i, j = 1 \sum 3 ∥ X_{i} W_{i} - X_{j} W_{j} ∥_{F}^{2},

subject to W_{i}^{T} Σ_{ii} W_{i} = I, w_{ik}^{T} Σ_{ij} w_{j l} = 0

i, j = 1, 2, 3, i \neq = j k, l = 1, \dots, d, k \neq = l

C_{11} C_{21} C_{31} C_{12} C_{22} C_{32} C_{13} C_{23} C_{33} w_{1} w_{2} w_{3} = C_{11} 00 0 C_{22} 0 00 C_{33} w_{1} w_{2} w_{3},

C_{11} C_{21} C_{31} C_{12} C_{22} C_{32} C_{13} C_{23} C_{33} w_{1} w_{2} w_{3} = C_{11} 00 0 C_{22} 0 00 C_{33} w_{1} w_{2} w_{3},

s im (X_{1}, X_{2}^{*}) = \frac{( X _{1} W _{1} D _{1} ) ( X _{2}^{*} W _{2} D _{2} ) ^{T}}{∥ ( X _{1} W _{1} D _{1} ) ∥ _{2} ∥ ( X _{2}^{*} W _{2} D _{2} ) ∥ _{2}}

s im (X_{1}, X_{2}^{*}) = \frac{( X _{1} W _{1} D _{1} ) ( X _{2}^{*} W _{2} D _{2} ) ^{T}}{∥ ( X _{1} W _{1} D _{1} ) ∥ _{2} ∥ ( X _{2}^{*} W _{2} D _{2} ) ∥ _{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Understanding urban landuse from the above and ground perspectives: a deep learning, multimodal solution

Shivangi Srivastava

John E. Vargas-Muñoz

Devis Tuia

Laboratory of Geo-information Science and Remote Sensing, Wageningen University & Research, the Netherlands

Laboratory of Image Data Science, Institute of Computing, University of Campinas, Campinas, Brazil

Abstract

This is the pre-acceptance version, to read the final version published in the journal Remote Sensing of Environment, please go to: https://doi.org/10.1016/j.rse.2019.04.014 Landuse characterization is important for urban planning. It is traditionally performed with field surveys or manual photo interpretation, two practices that are time-consuming and labor-intensive. Therefore, we aim to automate landuse mapping at the urban-object level with a deep learning approach based on data from multiple sources (or modalities). We consider two image modalities: overhead imagery from Google Maps and ensembles of ground-based pictures (side-views) per urban-object from Google Street View (GSV). These modalities bring complementary visual information pertaining to the urban-objects. We propose an end-to-end trainable model, which uses OpenStreetMap annotations as labels. The model can accommodate a variable number of GSV pictures for the ground-based branch and can also function in the absence of ground pictures at prediction time. We test the effectiveness of our model over the area of Île-de-France, France, and test its generalization abilities on a set of urban-objects from the city of Nantes, France. Our proposed multimodal Convolutional Neural Network achieves considerably higher accuracies than methods that use a single image modality, making it suitable for automatic landuse map updates. Additionally, our approach could be easily scaled to multiple cities, because it is based on data sources available for many cities worldwide.

keywords:

Landuse characterization, convolutional neural networks, overhead imagery, ground-based pictures, volunteered geographic information, urban areas, multi-modal, canonical correlation analysis, missing modality

††journal: Remote Sensing of Environment

1 Introduction and Related Work

According to the UN report “The World’s Cities in 2016”111http://www.un.org/en/development/desa/population/publications/pdf/urbanization/the_worlds_cities_in_2016_data_booklet.pdf, the population living in urban areas will rise from 4 billions in 2016 to a projected 5 billions in 2030. Therefore, it becomes important to gather information about how land is being utilized in urban areas. This information provides insights to city planners, helping them managing current urban infrastructure as well as planning for future cities. In this paper, landuse is defined as the utility of a particular area for humans: for example, an area could be used as a school, a park, a museum or a hospital. The mapping of various landuses is traditionally done through field surveys, which are often time consuming, expensive and labor intensive to carry out. This makes it impractical to frequently update these maps. Therefore, it is imperative to design models capable of automating the generation of landuse maps using data-driven approaches.

In the last decade, great advances have been observed for the automation of landcover maps using remote sensing imagery [1, 2, 3] and current large scale efforts extend this logic to multiple cities worldwide [4, 5]. Landcover mapping considers the characterization of various materials visible on the Earth’s surface, for example, crops, orchards, forests, water bodies, roads or buildings. Earlier solutions to the problem classified each pixel based solely on its spectral signature [6], since this information is correlated with the underlying material.In cases where the spectral information would not be sufficient to discriminate between landcover classes, contextual and texture information [7] were integrated, usually by analyzing a fixed size window around each pixel. Later, unsupervised segmentation methods were widely used to partition the image and perform object-based classification, allowing to extract more discriminative features and also contextual information from neighbor regions [8, 9]. More recently, Convolutional Neural Networks (CNN) have attained more accurate classification results [10]. CNNs learn in a supervised way, a hierarchy of filters to extract high-level features, using both spectral and spatial information. They have been used to perform classification in a patch-based way [11, 12, 13] and also to classify all the pixels of the input image in one forward pass [14, 15].

Following a similar approach based on overhead images only to generate accurate large scale landuse222We define landuse as the way in which a delimited geographical space is utilized by humans. For example, this might be a hospital, a school, a museum, a park, etc. maps is not an easy task, because the spectral signature of materials alone is not sufficient for discerning different landuse types. The problem is two-fold: 1) most of the times, a landuse class is made of a combination of different landcover types. For example, a university could have in its premises buildings, trees, grass, water bodies and roads. 2) The same landcover types are observed across multiple landuse classes. For example, when seen from above, similar building architectures could be a government office or a school (see Figure 1).

Therefore, generating an accurate landuse map at the urban-object333We define an urban-object as a spatial construct in an urban space with a clear physical boundary of its own, which could be a closed construct (like shop, office), semi-open construct (like stadium), or an open space (e.g. a natural forest or man-made park). level from overhead imagery alone is a challenging task. Still, some works have been done in this direction, typically following a patch-based classification scheme [16, 17, 18, 19, 20] or hybrid approaches that involves patch- and object-based analysis [21]. A typical pattern in these studies is the search for more representative feature spaces to describe landuse, for instance using textures and context [22, 23] or higher order information [24, 25]. The assumption is that, when seen from the top, different landuse types show different structural characteristics. Some recent works also explored the use of data from other sources, such as road networks or OpenStreetMap444https://www.openstreetmap.org/ (OSM) vector data [26]. The assumption in these cases is that the remotely sensed information alone is insufficient in describing landuse, and that the incorporation of complementary, meaningful data sources is beneficial.

In parallel, researchers have also approached the landuse mapping problem from the ground perspective, typically by using other data sources such as ground based pictures from online repositories (e.g. Flickr, Instagram, Geograph) [27, 28, 29, 30]. The ground-based viewpoint of these pictures provides crucial information on the function of urban-objects conventionally hidden from the view above, such as school entrances. However, the pictures from these repositories also have shortcomings: 1) they are often not accurately geo-referenced; 2) they sometimes depict highly personalized content (mostly touristic viewpoints, selfies or zoomed objects) rather than visual information about the urban-object; 3) they tend to cover the city unevenly (most pictures are geo-located in touristic areas). These problems make such pictures databases less suitable for our purpose, i.e., reliable landuse mapping of a city. Nonetheless, thanks to the availability of services like Google Street View555https://developers.google.com/maps/documentation/streetview/ (GSV), it is nowadays possible to obtain ground-based pictures for many urban-objects with objective content, which are accurately geo-located and are densely available across many cities worldwide. These GSV pictures are also updated regularly and it is possible to access historical data. GSV pictures have proven to be beneficial for complex tasks such as urban trees detection [31] or detection of urban fabric changes [32]. For a review of recent papers dealing with aerial to ground fusion tasks, please refer to [33].

GSV is also being increasingly used in landuse classification [34, 35, 36, 37]. Authors in [35] used a deep Convolutional Neural Network (CNN) to perform store front classification in 13 business categories from single GSV pictures. Authors in [36] classify the landuse of urban-objects into 8 classes by using GSV pictures and labels from OSM. The model predicts one label for each picture in the set of GSV pictures corresponding to one urban-object. The final predicted label corresponds to the class with the maximum average classification score. This last strategy might be suboptimal for our case: since the model learns landuse of an urban-object from pictures considered independently, thus it will force images with similar typical objects (e.g. pictures with trees) to be classified into different landuse classes. This makes training unnecessarily difficult and leaves the final decision to the majority vote, which can succeed only under a strong assumption: that each urban-object of a class will be imaged mostly with pictures containing objects that are both typical and unique for that specific class. Instead, we argue that each landuse category is made of different objects present in a set of images: in our previous work [34], we proposed a model that learns class representations from ensembles of GSV pictures. In this paper, we extend it to a multi-modal strategy, leveraging the complementarity of aerial and terrestrial views.

Landuse mapping using both terrestrial pictures and remote sensing data is a new and emerging field: to the best of our knowledge, the only paper dealing with it explicitly is [37] over New York City, by means of landuse labels provided by the New York City Department of City Planning. Using footprints and labels from authoritative sources makes the method less scalable to cities where such building footprints (and their landuse labels) could be either sparse, of insufficient quality or may have strongly variable landuse definitions across cities. Another important difference is that their proposed model performed per-pixel classification. The feature representation of each pixel was obtained using a fixed number $N_{loc}$ of nearby locations, where street level panoramas were available. For each of these $N_{loc}$ locations, GSV pictures looking in the four cardinal directions were used. A drawback of this approach is that pictures taken in such way provide features that may depict objects unrelated to the landuse observed at the pixel level.

In this paper, we learn a multimodal model leveraging visual information from both aerial and ground views to predict landuse at an urban-object level. Looking at the growing success of deep learning algorithms in remote sensing [10], we propose a model that combines visual information of overhead imagery and ground-based pictures associated with the urban-objects and trains end-to-end. The urban-object footprints and the ground truth labels are collected from OSM. We study the effectiveness of the proposed model on a case study in the region of Île-de-France (France). Our proposed model outperforms architectures based on unimodal data. This shows the importance and complementarity of both the data sources. For most landuse categories, the proposed multimodal model obtains accuracies above 70%.

Since GSV images are not always available or can be of insufficient quality (for instance by positioning errors or occlusions), we also propose a module able to process urban-objects for which the GSV images are missing: by using a joint three-view embedding space that projects into a common representation, the deep features obtained for two modalities (a set of GSV pictures and the overhead imagery imaging the same urban-object) and landuse categories data for each urban-object. This embedding space is useful, since it allows to perform cross-modality retrieval: by looking for nearest neighbors, the system is able to retrieve from the training set the most likely GSV feature vector for the urban-object and use it for prediction.

By combining standard deep learning building blocks in a new efficient way and using solely widely available data, our model can be easily deployed and also be transferred to new urban environments, where OSM annotations are available. The main contributions of the work are:

The development of a deep learning system based on widely available data to describe landuse classes at the urban-object level;

-

The design of a system that accepts a variable number of street-level images to describe appearance from multiple points of view;

-

The addition of an embedding module making the system robust to the lack of ground-based pictures for an urban-object at test time. In that case, an alternative ensemble of plausible GSV pictures from the training set is retrieved and used together with the overhead imagery to predict the landuse class accurately.

The paper is organized as follows: In Section 2 we present the proposed model in detail. Section 3 brings forward how the dataset was created for the region of Île-de-France. Section 4 shows the experimental setup while results are discussed in Section 5. Section 6 concludes the paper.

2 Methods

In this paper, we define landuse classification as the task of predicting a class label $l_{u}\in[1,...,K]$ of a given urban-object $u$ , where $K$ is the number of landuse classes. In our case, each urban-object is defined by a polygon footprint obtained from OSM (see Section 3), along with its label (also from OSM). In order to predict the category of the urban-object $u$ , we have a collection of $N_{u}$ ground-based pictures $\{\mathbf{x}_{u}^{i}\}_{i=1}^{N_{u}}$ and one overhead image $\mathbf{o}_{u}$ of this urban-object. The procedure to collect this dataset is discussed in Section 3.

Our proposed Convolutional Neural Network model is composed of two streams: the ‘Overhead Imagery Stream’ and the ‘Ground-based Pictures Stream’ (see Figure 4), that extracts discriminative features from overhead imagery and ground-based pictures, respectively. The features learned for the two streams are then combined to perform the prediction of the final landuse category. Note that we are not aiming at performing semantic segmentation at the pixel level, but our objective is rather to predict the landuse category of the urban-objects, which are vectorial objects in OpenStreetMap. In Sections 2.1 and 2.2, we describe the two CNN models that are used with either modality (these unimodal CNN models are also our baselines for comparison). In Section 2.3, we show how our proposed model combines the two streams to perform landuse classification. In 2.4 we discuss how to use a projective method based on canonical correlations to cope with situation where the GSV modality is not available at test time.

2.1 CNN Architecture for Overhead Imagery

This first baseline accepts remote sensing imagery and is thus related to traditional patch-based remote sensing image classification methods (e.g. [38]). For every OSM footprint, we use an overhead image crop that covers it completely. Figure 2 depicts our corresponding CNN architecture. The overhead imagery is used as an input for a sequence of convolutional blocks (violet part in Figure 2, with each block encompassing a convolution operation, followed by spatial pooling and a non-linear activation function (Rectified Linear Unit; ReLU) that outputs an activation map. Then, a fully connected layer converts the activation map into a high-dimensional feature vector (in green). Another fully connected layer is then applied that projects the feature vector into class scores; these are eventually normalized to $[0,1]$ by means of a softmax operation. The category with maximum score is considered as the final predicted class. Several works [16, 17] have shown good landuse classification performance by fine-tuning CNN models that were trained in large data sets for object recognition (i.e., ImageNet [39]). Similarly, we used the popular VGG16 architecture [40] pretrained on ImageNet as a base trunk to extract features (in violet in Figure 2).

2.2 Siamese-like Architecture for Ground Based Pictures

Urban-objects are generally surrounded by roads, which allows us to associate multiple GSV pictures to them. This means that for such an OSM footprint, we get discriminative and complementary representations thanks to GSV pictures capturing its object from different points of view. In our previous work [34] we exploited this observation and proposed the Variable Input Siamese Convolutional Neural Network (VIS-CNN). This model learns a single feature representation of an arbitrary number of GSV pictures for a given urban-object in an end-to-end manner.

Figure 3 depicts the VIS-CNN model for landuse classification using ground-based pictures. First, the convolutional blocks and the fully connected layers extract the feature vectors for each image. Note that the same CNN model (VGG16 [40], pre-trained on the ImageNet dataset) is used for each image to extract these features. Afterwards, the $N_{u}$ feature vectors $\mathbf{f(x}_{u}^{i})$ , one per each picture $i$ , pertaining to urban-object $u$ , are aggregated to obtain a single feature descriptor of the urban-object $u$ . In [34] we compared aggregation strategies based on average and max pooling:

[TABLE]

where $f(\cdot)^{j}$ is the $j^{th}$ element of the vector $\mathbf{f(\cdot)}$ . The max operator performs input selection picking the most important representation, among all the pictures, per element in the feature vector. The avg aggregator assigns importance to the most repeated attributes among all the pictures associated with the urban-object. Experimentally, we had observed that the avg aggregator peforms better than the max [34], thus we will use avg aggregator in the experiments below. Interestingly, this is also in line with very recent results obtained in the field of image deblurring from image sequences [41], where the authors proposed a very similar architecture as ours to cope with the problem of variability of the length of the sequence.

Finally, the computed aggregated vector $\mathbf{g}(u)$ is used as input of the last fully connected layer (classifier), that outputs the classification scores for each category to obtain the final prediction.

2.3 Multimodal CNN Architecture

The two models described in the previous sections have very similar bottlenecks, both corresponding to a d-dimensional fully connected layer. In this section, we take advantage of this similarity in order to perform late representation fusion.

Figure 4 depicts the proposed CNN model for multimodal landuse classification. For every urban-object $u$ we use its corresponding set of $N_{u}$ ground-based pictures $\{\mathbf{x}_{u}^{i}\}_{i=1}^{N_{u}}$ (used as inputs for the model described in Section 2.2), as well as its corresponding overhead imagery $\mathbf{o}_{u}$ (used as input of the model described in Section 2.1). In both cases, we stop at the level of feature extraction, i.e. we remove the classifiers in the architectures illustrated in Figures 2 and Figure 3 and only keep the convolutional blocks for feature extraction. Then, the image features are combined by a fully connected layer that outputs a score for each landuse category. After that, a softmax layer is applied to obtain normalized classification scores as for the previous models.

In order to learn the parameters of the CNN model, we use the cross-entropy loss function:

[TABLE]

where $\sigma(\widehat{l_{u}}=k|\mathbf{x}_{u}^{1},\ldots,\mathbf{x}_{u}^{N_{u}},\mathbf{o}_{u})$ is the softmax score given by the model for the urban-object $u$ and class $k$ .

2.4 Missing Modality Retrieval with Three-View CCA

In this section, we present a solution to cope with urban-objects, for which no street level picture is available at test time. We limit analyses to this case, as a situation with missing overhead imagery is less likely to happen. However, the approach is general and could as well be applied to such a scenario. We propose to compensate for the missing modality by retrieving the closest train GSV feature vector for the queried test overhead imagery feature vector. The GSV pictures for the retrieved closest GSV feature and the overhead imagery of the urban-object are used in situ as an input to the proposed multimodal model (see Section 2.3). The missing GSV modality retrieval task can be broadly divided into three steps (also illustrated in Figure 5):

Define the projection matrices for the joint embedding space by using the features extracted by the two CNN models (see Sections 2.1 and 2.2) on the training set. 2. 2.

Use these matrices to project the overhead CNN features for the test sample in the same embedding space. 3. 3.

Given the overhead projected features, find the nearest projected GSV feature neighbor from the training set. Which in turn, gives the nearest urban-object from the train set that we consider a proxy of what the urban-object would have looked like in GSV pictures. Once found, use the GSV pictures of this nearest neighbor urban-object in the multimodal model.

To define the joint embedding space, we exploit the fact that we have paired views of ensemble of GSV pictures and one overhead imagery for each urban-object in the training set, along with its landuse class. Under this assumption, we can define a space where two views (features from set of GSV pictures and top-view imagery) for an urban-object are projected close to each other and far from those of urban-objects belonging to different classes. This is possible because we are using class information that allows samples of the same class to be projected closer than samples coming from other land use classes (a typical assumption in this type of projective methods [42, 43]). To this end, we use a projective technique based on Canonical Correlation Analysis (CCA [44, 45]).

We have three datasets: $\hat{X}_{1}$ and $\hat{X}_{2}$ are the features issued from the two views (GSV and overhead imagery), while $X_{3}$ corresponds to the class labels. Each row of $\hat{X}_{1}$ , $\hat{X}_{2}$ and $X_{3}$ represents a feature vector coming from three different modalities, but representing the same object. Originally, the dimensions of the three dataset are $(N\times 4096)$ for $\hat{X}_{1}$ , $(N\times 4096)$ for $\hat{X}_{2}$ , and $(N\times 16)$ for $X_{3}$ (the sixteen classes labels are encoded as a sixteen dimensional one-hot vector, with $1$ for the correct class and [math] otherwise). To decrease the size of the matrices involved in the eigenvalue decomposition problem involved in CCA, a Principal Component Analysis (PCA) is applied to matrices $\hat{X}_{1}$ and $\hat{X}_{2}$ separately. This is a common practice in nonlinear dimensionality reduction, since embedding high-dimensional spaces is very difficult because of the curse of dimensionality and the noise in high dimensional data [46]. In the following, we refer to the matrices obtained after PCA reduction as $X_{1}$ with size $(N\times d_{1})$ and $X_{2}$ with size $(N\times d_{2})$ , where $d_{1},d_{2}<4096$ .

CCA finds projection matrices $W_{i}$ (one per view, $i=1,2,3$ ) that project the features $X_{i}$ from the view-specific spaces into a low-dimensional common embedding space, in which the distances between different views for the same data item are minimized (Equation (4)). The objective function for this problem can be written as :

[TABLE]

where $\Sigma_{ii}$ is the covariance matrix of $X_{i}$ and $w_{ik}$ is the $k^{th}$ column of $W_{i}$ . This problem can be solved as the following generalized eigenvalues problems as in Equation (5) (see [47] for details):

[TABLE]

where $C_{ij}=X_{i}^{T}X_{j}$ is the covariance matrix between the $i^{th}$ and $j^{th}$ views and $w_{i}$ is a column of $W_{i}$ . The size of this problem (corresponding to the maximal size of the embedding space) is $(d_{1}+d_{2}+d_{3})\times(d_{1}+d_{2}+d_{3})$ where $d_{i}$ is the dimensionality of the respective input data spaces (in our case, $4096$ for the CNN trained on GSV, $4096$ for the CNN trained on the overhead images and $16$ for the classes term). Also, a regularization parameter $\eta$ = $10^{-4}$ is added to the diagonal of the covariance matrix $C_{ij}$ to better condition the problem.

Once the projection matrices $W_{i}$ are learned (using the training set) by solving Equation (5), we can use them to project new, unseen test data into the latent space and assess their relative position with respect to samples from the training data (Step 2 in Figure 5). In our case, we want to project CNN features from the overhead view of the test urban-object in the joint embedding space, in order to retrieve the closest GSV feature vector. Usually, only the first few dimensions of the projected space are relevant for expressing correlations across views [45]. Hence it is a common practice to use only the top eigenvectors to define the projection matrices. In order to do this, we keep the top $d_{emb}<<d_{1}+d_{2}+d_{3}$ eigenvectors as projection matrices $W_{1}$ , $W_{2}$ and $W_{3}$ . After this selection, the projection matrices have dimensionality: $W_{1}\in R^{d_{1}\times d_{emb}}$ , $W_{2}\in R^{d_{2}\times d_{emb}}$ and, $W_{3}\in R^{d_{3}\times d_{emb}}$ .

After projection, we can assess similarities between the projected vectors ( $X_{2}^{*}W_{2}$ ) of overhead data in test set ( $X_{2}^{*}$ ) and those coming from GSV in training set ( $X_{1}W_{1}$ ). To do so, we use the similarity function used in [47] as it leads to greater retrieval accuracy compared to that using Euclidean distance:

[TABLE]

where $W_{i}$ is the projection matrix and $D_{i}$ is a diagonal matrix containing $d_{emb}$ eigenvalues, with each entry raised to the power $p$ [47, 48]. Now, for any projected overhead imagery feature in the test set, we can query the closest projected GSV feature in the training set that minimizes Equation (6). The GSV pictures from the urban-object (corresponding to the resulting nearest GSV feature) together with the overhead imagery are used as input to the proposed multi-modal model (Figure 4). This way, we obtain the final label prediction as presented in Section 2.3.

3 Dataset

In order to evaluate our proposed method, we collected data from OSM, Google Maps and GSV in the region of Île-de-France, France. For this study we considered the metropolitan area of Paris and the nearby suburbs including Versailles, Orsay, Orly, Aulnay-sous-Bois, Le Bourget, Sarcelles, Chatou and Nanterre. For the supervised training stage of our multimodal CNN, we created an annotated dataset, which is made of an ensemble of side-view pictures and one overhead image view per urban-object with their corresponding landuse ground truth. The data collection procedure is detailed in the following subsections. Additionally, and in order to evaluate the generalization ability of the model trained with Île-de-France data, we have also gathered data and evaluated our method over the city of Nantes.

3.1 Annotations from OSM

We use OSM to obtain a collection of urban-objects with associated landuse categories. We group OSM landuse categories into 16 classes based on the similarity of their “usage” (For example, “lycée” and “école” are merged into a single class, “educational”. Synagogues and churches are merged into the class “religious”). Rarely appearing landuse classes like “crematorium” or “observatory” are not considered due to the limited amount of OSM footprints or of the corresponding GSV pictures. The selected 16 landuse classes are: “educational”, “hospital”, “religious”, “shop”, “cemetery”, “forest”, “park”, “heritage”, “sports”, “government”, “post office”, “parking, “fuel”, “marina”, “hotel”, “industrial”. We collected the spatial footprints and landuse labels of the selected OSM polygons. Labels were processed for consistency and disambiguation [34]. Two datasets are created, the first containing $5941$ urban-objects from the region of Île-de-France. A subset of this data is depicted in Figure 6. The second datasets contains $1835$ urban-objects from the city of Nantes. Both datasets contain the same landuse classes, with the exception of the class “Marina” in the city of Nantes, that was omitted due to the lack of urban-objects available (only one urban-object was retrieved from OSM for Nantes).

3.2 Ground-based Pictures and Overhead Imagery

To obtain the ground-based pictures corresponding to each urban-object, we used the Google Street View API. We downloaded a set of pictures from various viewpoints (Figure 7) in the following way: to collect the images oriented towards the urban-object, we selected the roads nearest to that urban-object and downloaded pictures (of size $640\times 640$ pixels) looking at the façade of the urban-object from different viewpoints and at a distance of maximum 12 meters from the object itself. Additionally, pictures located within the urban-object (which are often uploaded by users) were also retrieved using the same API. In this last situation, and when applicable, we downloaded pictures for inside locations in the four cardinal directions. For the $5941$ urban-objects present in the OSM footprints dataset of Île-de-France, we downloaded a total of $44957$ GSV pictures, while for the $1835$ urban-objects corresponding to Nantes we downloaded $9908$ GSV pictures.

Regarding the aerial images, we used the Google Maps Static API to obtain the top-view image of each urban-object, ensuring that the downloaded imagery covered the entire footprint. The original downloaded images have size of $1280\times 1280$ pixels, with ground pixel resolution depending on the width of the urban-object footprint. We downsampled the overhead images to $240\times 240$ pixels to be used in the CNN model. The number of overhead images corresponds to the number of footprints, i.e. $5941$ for Île-de-France and $1835$ for Nantes.

4 Experimental Setup

4.1 Joint CNN Training

To extract features from each image, we used the VGG16 model [40], both for the multimodal CNN and the baselines. For all models, the hyperparameters were kept fixed and the models were trained end-to-end with the following settings: the number of urban-objects processed in each training iteration was $4$ , while the initial learning rate was set to $0.001$ . Further, the learning rate was divided by a factor of 10 after every 10 epochs. The training was pursued for 50 epochs with Stochastic Gradient Descent (SGD) with momentum [49] as an optimizer. For data augmentation, we used the following strategies:

We resized the GSV pictures to $256\times 256$ pixels, followed by random crops of size $224\times 224$ pixels. The cropped image underwent random horizontal flipping and was normalized using the mean and the standard deviation values from the ImageNet dataset.

2.

The overhead images were downscaled to $240\times 240$ pixels and randomly flipped in both vertical and horizontal directions, to strengthen invariance in the model.

The dataset was split into five different train and test sets. For each split, we randomly selected 80% of the urban-objects per landuse class for training and the remaining was set aside for testing. Note that the train and test sets are mutually exclusive. We calculate overall accuracy (OA) and average of accuracy per class (AA) over the test set in each split. The averaged OA and AA over 5 splits per model is presented in Table 1. All the experiments were run on a server running Linux and featuring a GeForce GTX 1080 Ti GPU. We used the PyTorch CNN library666http://pytorch.org/ for the computations. The time to train the multimodal model for 50 epochs was between $15-16$ hours, while the Siamese model took between $11-12$ hours and the overhead model was trained in $3-4$ hours.

4.2 Missing modality retrieval

After studying the ability of the system to predict landuse, we examined the possibility of using the CCA-based retrieval algorithm presented in Section 2.3 to process urban-objects for which GSV data are not available. As detailed in the methodology section, we used the training data to define the embedding space. The features were extracted by using the VIS-CNN model (Section 2.2) for GSV pictures and VGG16 for the overhead imagery (Section 2.1). The feature vectors were normalized by dividing each one by its $L2$ norm. The CCA system has three hyperparameters, which we fixed empirically:

$\%pca$ is the percentage of total feature dimension kept after applying PCA. The resulting dimensions of data matrices $X_{1}$ and $X_{2}$ are $N\times d_{1}$ and $N\times d_{2}$ respectively, where $d_{1},d_{2}=410$ (10% of $4`096$ ). For the label matrix $X_{3}$ , we keep $d_{3}=16$ . The final dimension of eigenvalue decomposition (equation 5) decreases from $8`208$ to $836$ .

-

$\%d_{emb}=\frac{d_{emb}}{d_{1}+d_{2}+d_{3}}$ is the percentage of eigenvectors kept to compute the projection matrices and corresponds to the final dimension of the embedding space. It was chosen empirically as $\%d_{emb}=0.2$ .

-

$p$ is the power of the eigenvalues matrices $D_{i}$ in Equation (6). It was chosen empirically as $p=6$ .

We will also present a study of the sensitivity of the free parameters in Section 2.4.

5 Results and Discussion

5.1 Joint CNN Training

The class accuracies are shown in Figure 8; averaged OA and AA values are given in Table 1. By comparing our multimodal model against the unimodal variants, we observe an increase of around 6% for OA and more than 7% for AA against the VGG16-based model trained on overhead imagery, while a sharp increase of more than 10% for both OA and AA is observed when comparing with VIS-CNN trained on GSV pictures. Additionally, we evaluated our proposed Multimodal CNN and VIS-CNN using different base CNN models. Specifically, AlexNet [49] that was used in [50] to perform landuse mapping with mutltispectral remote sensing images and ResNet50 [51] that was used in [52] to do large-scale land cover classification of satellite imagery. The results of these methods are presented in Table 2. Similar gains in performance are observed for the Mutilmodal CNN with respect to the unimodal models.

Looking at the per-class predictions (Figure 8), we can observe that our proposed multimodal model outperforms the baselines for almost all of the classes. Landuse classes like educational, hospital, post-office and fuel benefit from a jump of more than 9%, while classes like religious and hotel see an increase of more than 4% in their accuracies.

Some of the correct predictions of our model can be seen in Figure 9. For each example, we discuss briefly the complementary visual cues that are used by the multimodal model to predict the landuse category. For the class educational (with accuracy 77%), objects like playgrounds within the school campus are visible in overhead imagery. This information complements the one brought by the side-views, including flags, a big entrance, the presence of metal fencing and broad pedestrian walks, or the presence of children (first row, Figure 9). If we analyze the overhead imageries pertaining to religious places (accuracy 78%), we notice stylized roofs with absence of pipes, chimneys, exhausts, and the like. This adds complementary information to the big arched doors, rose windows and stained glasses coming from the ground pictures (Figure 9, second row). The third row in Figure 9 shows the overhead imagery and set of GSV pictures for a correctly predicted sample for class cemetery, which has a very high accuracy (92%). We can observe several visual cues in the overhead imagery, like the specific grid pattern of the grave stones, separated by wide alleys. This has been complemented by the ground views, which contain visible long continuous walls typical for cemeteries. Finally, in the case of the post office (accuracy 61%), the overhead imagery shows yellow delivery vans in the parking close by. This adds to characteristic visual objects that are usually present in the ground pictures, like the yellow “la-poste” signboard (seventh row, Figure 9).

Classes like government and shop, despite having training sets of 400 and 267 objects respectively, have comparatively lower accuracy scores (see Figure 8) for all the models. In the case of the multimodal model, the accuracies are still around 48% and 57%, respectively. Surprisingly, for the class fuel, though the number of training samples is only 122, its accuracy score is much higher (84.5%). We attribute the good result for the class fuel to the distinctive visual information from both ground and top views (see sixth row, Figure 9), which allows the CNN to perform well, even in the absence of a large dataset. On the contrary, classes like heritage sites and sports show a very small decrease in their accuracy scores compared to the VIS-CNN (for GSV pictures) and VGG16 (overhead imagery), respectively (see Figure 8). In the case of heritage sites, the overhead imagery does not carry discriminative information from the top view (as evident through the poor accuracy of 15.8% for the overhead model), which degrades the quality of the multimodal result as well.

Some misclassifications are shown in Figure 10. For example, the model predicts class educational for the “government” urban-object in the first row (Figure 10). This most probably emerges from the presence of information similar to that of an educational place in the ground views, such as the presence of objects like open spaces and benches in front or metallic fences enclosing the building. The second row of Figure 10 shows a parking area that has been predicted as a park, most likely due to the many trees visible in both the top and the ground views. In the third row of Figure 10, the urban-object with class religious was predicted as an industrial facility, possibly due to the large parking area with cars as seen in both the top and the side views, while the church far in the distance is vaguely visible. Wrong label predictions are sometimes observed because of the low quality of the downloaded GSV pictures. We found two issues about the downloading of GSV pictures for OSM polygons: i) in some cases the OSM polygons do not match with the actual boundaries of the urban-objects and ii) the distance-based heuristic used to download GSV pictures is sometimes inaccurate and leads to the download of pictures of other nearby urban-objects. These issues are also discussed in [34].

In order to show in more detail the accuracy of the model for each class, in Figure 11 we present the confusion matrix generated by averaging the test accuracy of the Multimodal CNN method (with VGG16 as base CNN model) for Ile-de-France dataset. We can see that classes like “Hospital”, “Heritage”, and “Post-Office” are often wrongly predicted as class “Government”. We can also observe that the urban-objects of “Forest” are sometimes classified as “Park” and urban-objects of ”Shop” are occasionally misclassified as “Hotel”.

5.2 Generalisability of the model in a new city

We have used the data from the city of Nantes to evaluate the generalisation ability of our model. In Table 3 we present the OA and AA scores of the proposed Multimodal CNN model and the two unimodal models, trained with Ile-de-france data. Overall, the model provides results in the ballpark of those observed for Île-de-France. AA scores are generally lower, mostly because the ‘Marina’ class omitted for this dataset was very accurate in the Île-de-France case (average of 86% Producer accuracy, see Figure 11). Comparing the methods in the Nantes case, the proposed Multimodal CNN is 5% more accurate in OA and 10% in AA with respect to the model that uses only overhead imagery. It also improves the accuracy of VIS-CNN by more than 16% in OA and 11% in AA, once again confirming the observations made in the first dataset. Note that we ran inference on the Nantes urban-objects directly, without finetuning any further the models.

5.3 Missing modality retrieval

In this section, we test the ability of our model to predict landuse when the GSV pictures are missing. To do so, we use the CCA-based system presented in Section 2.4.

5.3.1 Numerical performance

The overall results are reported in the last row of Table 1, which shows the accuracy obtained by retrieving the missing GSV pictures for an urban-object that just have an overhead imagery and then performing the label prediction using the proposed multimodal model. We can observe that the accuracy obtained by this method is higher by more than 4% in OA compared to the model that just uses overhead imagery (Section 2.1).

Figure 12 shows examples of retrieved GSV pictures (corresponding to urban-objects with the highest similarity scores) for five different overhead images. The first three rows show positive examples, with retrieved GSV pictures belonging to the same class as the queried overhead imagery. In these three examples, the retrieved ground-based pictures have discriminative visual features that can help to predict the correct labels when using the multimodal model, even though they come from another urban-object. The fourth and fifth row present negative retrieval examples, were the retrieved GSV pictures belong to a different class compared to the queried overhead imagery. Note that the overhead image in the fourth row belongs to class “sports” as it contains a tennis court. However, since it is occluded by trees, the closest GSV pictures that were retrieved belonged to the class “forest”.

Figure 13 shows the classification results per class in terms of producer’s accuracy for one run of the algorithm. One can appreciate the accuracy of the direct retrieval of the nearest neighbors labels (blue bars), which is around $70\%$ for seven out of the sixteen classes. Poor results are obtained for classes ‘Hospital’, ‘Heritage’ and ‘Post office’. These classes correspond to those with less examples in the training set. Using the GSV pictures of the retrieved training objects together with the true overhead images in the multimodal model (orange bars, corresponding to our proposition) strongly improves the results and almost closes the gap with the full multimodal model (green bars). The latter is an upper bound on performance, since it uses the real GSV pictures. The classes for which the accuracy of the full model is not matched correspond to those with low number of samples, which already had a poor retrieval accuracy in the embedding space.

5.3.2 Label coherence in the embedding space

To follow up this last observation, we analyze the label coherence in the embedding space, i.e. we want to verify that the urban-objects without GSV pictures are projected close to other urban-objects of the correct class. The blue curve in Figure 14 illustrates the trend for an increasing number of nearest neighbors (i.e. a $top-k$ accuracy). After projection, the test urban-object is mapped close to a sample of the correct class $62\%$ of the times, but this percentage increases when considering more neighbors in the embedding space (up to $69\%$ of the test samples are mapped close to at least one training sample of the correct class): this shows that the CCA space is coherent in terms of labels and that the retrieval can be successful. However, such increase in top- $k$ accuracy has surprisingly little influence of the performance of the final multimodal model (red solid curve in Figure 14): even when using GSV pictures of the four nearest neighbors in the CCA space, the increase in performance is of $1\%$ only. We believe this modest increase in performance is due to the fact that, even though at least one training urban-object retrieved is of the right class, at most $k-1$ others will be of an incorrect class, which might confuse the GSV stream and impede larger improvements. To support this hypothesis, we evaluated the average number of nearest neighbors of the correct class: $0.65$ for $k=1$ , $1.22$ for $k=2$ , $1.85$ for $k=3$ and $2.46$ for $k=4$ . Therefore, for smaller values of $k$ , the GSV stream will receive pictures from objects of the right class approximately 60% of the times, which allows it to provide a robust response leveraging the discriminative information in the overhead view.

5.3.3 Sensitivity to the parameters of the CCA model

Finally, we provide an analysis of the sensitivity of the CCA model to its free parameters. For the results in fourth row of Table 1, we empirically selected the parameter values of the proposed method: %pca $=0.1$ , % $d_{emb}=0.2$ and $p=6$ . Figure 15 shows the overall retrieval accuracy when fixing two of the three parameter values and varying the values of the third. These accuracies were computed by projecting the overhead imagery features of the test set into the embedding space and using the label of the nearest urban-object of the training set for prediction. We observed that the proposed system behaves in a stable manner when varying the hyperparameters.

6 Conclusions and Outlook

In this work, we presented a multimodal model for landuse classification that uses pictures from top and ground views with annotations from OpenStreetMap. The proposed model learns end-to-end both the feature extraction from single modalities and their fusion. We evaluated our proposed method in the region of Île-de-France, France and found that, for many classes, the complementary visual information contained in either modality improved the accuracy of the model by a large margin. Our proposed multimodal CNN model can also predict landuse labels when ground-based pictures are not available for an urban-object by searching for the most plausible set of GSV pictures in the training set.

Using widely available data repositories for images (Google Street View and Google Maps) and public participatory vector annotations (OpenStreetMap) gives an edge to our model, as it is scalable to several other cities. The accuracies could be further improved by having a better quality dataset. This could be achieved by sourcing better quality labels (e.g., labels from other sources like Google Places) and/or refining heuristics for downloading the GSV pictures (e.g., collecting pictures that are looking at the urban-objects’ facade more accurately). For future work, we plan to explore the image information available at multiple scales as an input for our proposed model, as well as integrating fine-grained object detection in the ground images (e.g. objects like ambulances) as extra information cues.

Acknowledgment

The authors would like to thank Google and OSM for the access to pictures and objects’ footprints respectively through their APIs. This work has been supported by the Swiss National Science Foundation (grant PZ00P2-136827 (SS,DT, http://p3.snf.ch/project-136827). JEVM acknowledges FAPESP (grant 2016/14760-5, 2017/10086-0) for support.

References

[1]

C. Homer, J. Dewitz, L. Yang, S. Jin, P. Danielson, G. Xian, J. Coulston, N. Herold, J. Wickham, K. Megown, Completion of the 2011 national land cover database for the conterminous united states–representing a decade of land cover change information, Photogrammetric Engineering & Remote Sensing 81 (5) (2015) 345–354.

[2]

T. Postadjiana, A. L. Brisa, H. Sahbib, C. Mallet, Investigating the potential of deep neural networks for large-scale classification of very high resolution satellite images, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 4 (2017) 183–190.

[3]

J. Inglada, A. Vincent, M. Arias, B. Tardy, D. Morin, I. Rodes, Operational high resolution land cover map production at the country scale using satellite image time series, Remote Sensing 9 (1) (2017) 95.

[4]

H. Taubenbock, T. Esch, M. Wiesner, A. Roth, S. Dech, Monitoring urbanization in mega cities from space, Remote Sensing of Environment 117 (2012) 162–176.

[5]

I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, R. Raskar, Deepglobe 2018: A challenge to parse the earth through satellite images, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 172–17209.

[6]

N. Riggan Jr, R. C. Weih Jr, Comparison of pixel-based versus object-based land use/land cover classification methodologies, Journal of the Arkansas Academy of Science 63 (1) (2009) 145–152.

[7]

S. W. Myint, A robust texture analysis and classification approach for urban land-use and land-cover feature discrimination, Geocarto International 16 (4) (2001) 29–40.

[8]

T. Blaschke, G. J. Hay, M. Kelly, S. Lang, P. Hofmann, E. Addink, R. Q. Feitosa, F. van der Meer, H. van der Werff, F. van Coillie, D. Tiede, Geographic object-based image analysis - towards a new paradigm, ISPRS Journal of Photogrammetry and Remote Sensing 87 (2014) 180 – 191.

[9]

L. Ma, M. Li, X. Ma, L. Cheng, P. Du, Y. Liu, A review of supervised object-based land-cover image classification, ISPRS Journal of Photogrammetry and Remote Sensing 130 (2017) 277 – 293.

[10]

X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, F. Fraundorfer, Deep learning in remote sensing: A comprehensive review and list of resources, IEEE Geoscience and Remote Sensing Magazine 5 (4) (2017) 8–36.

[11]

M. Campos-Taberner, A. Romero-Soriano, C. Gatta, G. Camps-Valls, A. Lagrange, B. L. Saux, A. Beaupère, A. Boulch, A. Chan-Hon-Tong, S. Herbin, H. Randrianarivo, M. Ferecatu, M. Shimoni, G. Moser, D. Tuia, Processing of extremely high resolution LiDAR and RGB data: Outcome of the 2015 IEEE GRSS Data Fusion Contest. Part A: 2D contest, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (12) (2016) 5547–5559.

[12]

A. Sharma, X. Liu, X. Yang, D. Shi, A patch-based convolutional neural network for remote sensing image classification, Neural Networks 95 (2017) 19 – 28.

[13]

D. Tuia, M. Volpi, G. Moser, Decision fusion with multiple spatial supports by conditional random fields, IEEE Transactions on Geoscience and Remote Sensing 56 (6) (2018) 3277–3289.

[14]

M. Volpi, D. Tuia, Dense semantic labeling of subdecimeter resolution images with convolutional neural networks, IEEE Transactions on Geoscience and Remote Sensing 55 (2) (2017) 881–893.

[15]

N. Audebert, B. Le Saux, S. Lefèvre, Semantic segmentation of earth observation data using multimodal and multi-scale deep networks, in: Asian Conference on Computer Vision, Springer, 2016, pp. 180–196.

[16]

M. Castelluccio, G. Poggi, C. Sansone, L. Verdoliva, Land use classification in remote sensing images by convolutional neural networks, arXiv preprint arXiv:1508.00092(2015) .

[17]

F. Hu, G. Xia, J. Hu, L. Zhang, Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery, Remote Sensing 7 (11) (2015) 14680–14707.

[18]

T. Hermosilla, L. Riuz, J. Recio, M. Cambra-Lopez, Assessing contextual descriptive features for plot-based classification of urban areas assessing contextual descriptive features for plot-based classification of urban areas, Landscape and Urban Planning 106 (1) (2012) 124–137.

[19]

M. Voltersen, C. Berger, S. Hese, C. Schmullius, Object-based land cover mapping and comprehensive feature calculation for an automated derivation of urban structure types at block level, Remote Sensing of Environment 154 (2014) 192–201.

[20]

B. Bechtel, P. J. Alexander, J. Böhner, J. Ching, O. Conrad, J. Feddema, G. Mills, L. See, I. Stewart, Mapping local climate zones for a worldwide database of the form and function of cities, ISPRS International Journal of Geo-Information 4 (1) (2015) 199–219.

[21]

C. Zhang, I. Sargent, X. Pan, H. Li, A. Gardiner, J. Hare, P. M. Atkinson, An object-based convolutional neural network (OCNN) for urban land use classification, Remote Sensing of Environment 216 (2018) 57 – 70.

[22]

F. Pacifici, M. Chini, W. J. Emery, A neural network approach using multi-scale textural metrics from very high-resolution panchromatic imagery for urban land-use classification, Remote Sensing of Environment 113 (6) (2009) 1276–1292.

[23]

D. Tuia, R. Flamary, N. Courty, Multiclass feature learning for hyperspectral image classification: Sparse and hierarchical solutions, ISPRS Journal of Photogrammetry and Remote Sensing 105 (2015) 272–285.

[24]

M. Volpi, D. Tuia, Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images, ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 48–60.

[25]

D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, U. Stilla, Classification with an edge: Improving semantic image segmentation with boundary detection, ISPRS Journal of Photogrammetry and Remote Sensing 135 (2018) 158–172.

[26]

N. Yokoya, P. Ghamisi, J. Xia, S. Sukhanov, R. Heremans, C. Debes, B. Bechtel, B. L. Saux, G. Moser, D. Tuia, Open data for global multimodal land use classification: Outcome of the 2017 IEEE GRSS Data Fusion Contest, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (5) (2018) 1363–1377.

[27]

D. Leung, S. Newsam, Exploring geotagged images for land-use classification, in: Proceedings of the ACM multimedia 2012 workshop on Geotagging and its applications in multimedia, 2012, pp. 3–8.

[28]

Y. Zhu, S. Newsam, Land use classification using convolutional neural networks applied to ground-level images, in: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2015, pp. 61:1–61:4.

[29]

L. Tracewski, L. Bastin, C. C. Fonte, Repurposing a deep learning network to filter and classify volunteered photographs for land cover and land use characterization, Geo-spatial Information Science 20 (3) (2017) 252–268.

[30]

Y. Zhu, X. Deng, S. Newsam, Fine-grained land use classification at the city scale using ground-level images, arXiv preprint arXiv:1802.02668(2018) .

[31]

J. Wegner, S. Branson, D. Hall, K. Schindler, P. Perona, Cataloging public objects using aerial and street-level images-urban trees, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 6014–6023.

[32]

N. Naik, S.-D. Kominers, R. Raskar, E.-L. Glaeser, C.-A. Hidalgo, Computer vision uncovers predictors of physical urban change, Proceedings of the National Academy of Sciences of the United States of America 114 (29) (2017) 7571–7576.

[33]

S. Lefèvre, D. Tuia, J. D. Wegner, T. Produit, A. S. Nassar, Towards seamless multi-view scene analysis from satellite to street-level, Proceedings of the IEEE 105 (10) (2017) 1884–1899.

[34]

S. Srivastava, J. E. V. Muñoz, S. Lobry, D. Tuia, Fine grained landuse characterization using ground-based pictures: an open data, deep learning-based solution, International Journal of Geographical Information Science 0 (0) (2018) 1–20.

[35]

Y. Movshovitz-Attias, Q. Yu, M. C. Stumpe, V. Shet, S. Arnoud, L. Yatziv, Ontological supervision for fine grained classification of street view storefronts, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1693–1702.

[36]

J. Kang, M. Körner, Y. Wang, H. Taubenböck, X. X. Zhu, Building instance classification using street view images, ISPRS Journal of Photogrammetry and Remote Sensing (2018) 44–59.

[37]

S. Workman, M. Zhai, D.-J. Crandall, N. Jacobs, A unified model for near and remote sensing, in: Proceedings of the IEEE International Conference on Computer Vision, 2017.

[38]

O. A. B. Penatti, K. Nogueira, J. A. dos Santos, Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Earthvision, 2015, pp. 44–51.

[39]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.

[40]

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint arXiv:1409.1556(2014) .

[41]

M. Aittala, F. Durand, Burst image deblurring using permutation invariant convolutional neural networks, in: European Conference on Computer Vision, 2018.

[42]

D. Tuia, M. Volpi, M. Trolliet, G. Camps-Valls, Semisupervised manifold alignment of multimodal remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 52 (12) (2014) 7708–7720.

[43]

D. Tuia, D. Marcos, G. Camps-Valls, Multi-temporal and multi-source remote sensing image classification by nonlinear relative normalization, ISPRS Journal of Photogrammetry and Remote Sensing 120 (2016) 1–12.

[44]

A. A. Nielsen, K. Conradsen, J. J. Simpson, Multivariate alteration detection (MAD) and MAF postprocessing in multispectral, bitemporal image data: New approaches to change detection studies, Remote Sensing of Environment 64 (1) (1998) 1–19.

[45]

M. Volpi, G. Camps-Valls, D. Tuia, Spectral alignment of cross-sensor images with automated kernel canonical correlation analysis, ISPRS Journal of Photogrammetry and Remote Sensing 107 (2015) 50–63.

[46]

J. A. Lee, M. Verleysen, Nonlinear dimensionality reduction, Springer, 2007.

[47]

Y. Gong, Q. Ke, M. Isard, S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, International Journal of Computer Vision 106 (2) (2014) 210–233.

[48]

O. Chapelle, J. Weston, B. Schölkopf, Cluster kernels for semi-supervised learning, in: Advances in Neural Information Processing Systems, 2003, pp. 601–608.

[49]

A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.

[50]

B. Huang, B. Zhao, Y. Song, Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery, Remote Sensing of Environment 214 (2018) 73 – 86.

[51]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[52]

X.-Y. Tong, Q. Lu, G.-S. Xia, L. Zhang, Large-scale land cover classification in gaofen-2 satellite imagery, in: IEEE International Geoscience and Remote Sensing Symposium, 2018.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Homer, J. Dewitz, L. Yang, S. Jin, P. Danielson, G. Xian, J. Coulston, N. Herold, J. Wickham, K. Megown, Completion of the 2011 national land cover database for the conterminous united states–representing a decade of land cover change information, Photogrammetric Engineering & Remote Sensing 81 (5) (2015) 345–354.
2[2] T. Postadjiana, A. L. Brisa, H. Sahbib, C. Mallet, Investigating the potential of deep neural networks for large-scale classification of very high resolution satellite images, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 4 (2017) 183–190.
3[3] J. Inglada, A. Vincent, M. Arias, B. Tardy, D. Morin, I. Rodes, Operational high resolution land cover map production at the country scale using satellite image time series, Remote Sensing 9 (1) (2017) 95.
4[4] H. Taubenbock, T. Esch, M. Wiesner, A. Roth, S. Dech, Monitoring urbanization in mega cities from space, Remote Sensing of Environment 117 (2012) 162–176.
5[5] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, R. Raskar, Deepglobe 2018: A challenge to parse the earth through satellite images, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 172–17209.
6[6] N. Riggan Jr, R. C. Weih Jr, Comparison of pixel-based versus object-based land use/land cover classification methodologies, Journal of the Arkansas Academy of Science 63 (1) (2009) 145–152.
7[7] S. W. Myint, A robust texture analysis and classification approach for urban land-use and land-cover feature discrimination, Geocarto International 16 (4) (2001) 29–40.
8[8] T. Blaschke, G. J. Hay, M. Kelly, S. Lang, P. Hofmann, E. Addink, R. Q. Feitosa, F. van der Meer, H. van der Werff, F. van Coillie, D. Tiede, Geographic object-based image analysis - towards a new paradigm, ISPRS Journal of Photogrammetry and Remote Sensing 87 (2014) 180 – 191.