Deep Built-Structure Counting in Satellite Imagery Using Attention Based   Re-Weighting

Anza Shakeel; Waqas Sultani; Mohsen Ali

arXiv:1904.00674·cs.CV·April 2, 2019

Deep Built-Structure Counting in Satellite Imagery Using Attention Based Re-Weighting

Anza Shakeel, Waqas Sultani, Mohsen Ali

PDF

Open Access

TL;DR

This paper introduces a deep learning framework with attention-based re-weighting for accurately counting built-structures in satellite images, addressing challenges like shape variance and overlapping boundaries.

Contribution

It proposes a novel attention-based fusion network and a large-scale dataset for improved built-structure counting in diverse satellite imagery.

Findings

01

Achieved a Mean Absolute Error of 3.65 in building count prediction.

02

Demonstrated high correlation with an R-squared of 88%.

03

Validated on unseen regions with an error of 19 buildings out of 656.

Abstract

In this paper, we attempt to address the challenging problem of counting built-structures in the satellite imagery. Building density is a more accurate estimate of the population density, urban area expansion and its impact on the environment, than the built-up area segmentation. However, building shape variances, overlapping boundaries, and variant densities make this a complex task. To tackle this difficult problem, we propose a deep learning based regression technique for counting built-structures in satellite imagery. Our proposed framework intelligently combines features from different regions of satellite image using attention based re-weighting techniques. Multiple parallel convolutional networks are designed to capture information at different granulates. These features are combined into the FusionNet which is trained to weigh features from different granularity differently,…

Tables3

Table 1. Table 1: Details of collected dataset.

Landscape	Number of Images	Area Covered $k m^{2}$
Urban Areas	2211	22.55
Hilly Areas	251	2.56
Desert	220	2.24
Total	2682	27.35

Table 2. Table 2: Segmentation results of SS-Net on Village Finders test set. The results demonstrate high accuracy of propose technique.

Evaluating Metric	SS-Net results
Pixel-wise accuracy	0.947
F1 score	0.8

Table 3. Table 3: Total Absolute Error of structures in the set of Low-Count, set of Medium-Count and set of High-Count ranges, numbers in the bracket represent the building count in that satellite image patch. Where set Low-Count contains 3880, Medium-Count contains 3937 and High-Count contains 1128 structures in the test set. Mean Absolute Error (MAE) and R 2 superscript 𝑅 2 R^{2} score of each model is also listed.

Models

DRC

GWAP

CCPP

FusionNet

Total Absolute Error (Low-Count : 0 to 30)

1158

1121

1136

1001

Total Absolute Error (Medium-Count:31 to 60)

814

820

796

743

Total Absolute Error (High-Count:

>

60)

229

180

161

176

Total Absolute Error (TAE: Total)

2201

2121

2093

1920

Mean Absolute Error (TAE

/

(Total Number of Images))

4.14

3.99

3.94

3.61

R^{2}

(coefficient of determination)

0.86

0.872

0.875

0.88

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRemote-Sensing Image Classification · Video Surveillance and Tracking Methods · Remote Sensing and Land Use

Full text

Deep Built-Structure Counting In Satellite Imagery Using Attention Based Re-weighting

Anza Shakeel

Waqas Sultani

Mohsen Ali Corresponding author. This is useful to know for communication with the appropriate person in cases with more than one author

Information Technology University, Lahore, Pakistan

(mscs15043, waqas.sultani, mohsen.ali)@itu.edu.pk

Abstract

In this paper, we attempt to address the challenging problem of counting built-structures in the satellite imagery. Building density is a more accurate estimate of the population density, urban area expansion and its impact on the environment, than the built-up area segmentation. However, building shape variances, overlapping boundaries, and variant densities make this a complex task. To tackle this difficult problem, we propose a deep learning based regression technique for counting built-structures in satellite imagery. Our proposed framework intelligently combines features from different regions of satellite image using attention based re-weighting techniques. Multiple parallel convolutional networks are designed to capture information at different granulates. These features are combined into the FusionNet which is trained to weigh features from different granularity differently, allowing us to predict a precise building count. To train and evaluate the proposed method, we put forward a new large-scale and challenging built-structure-count dataset. Our dataset is constructed by collecting satellite imagery from diverse geographical areas (planes, urban centers, deserts, etc.,) across the globe (Asia, Europe, North America, and Africa) and captures the wide density of built structures. Detailed experimental results and analysis validate the proposed technique. FusionNet has Mean Absolute Error of 3.65 and R-squared measure of 88% over the testing data. Finally, we perform the test on the $274.3\times 10^{3}$ $m^{2}$ of the unseen region, with the error of 19 buildings off the 656 buildings in that area. The dataset is available at http://im.itu.edu.pk/deepcount/.

keywords:

Land Use, Deep Learning, Regression, Attention Based Re-weighting, Building Count, Built-up Area Segmentation

1 Introduction

Accurate, detailed and up-to-date analysis of the urban and non-urban areas play a vital role in building an economic and social understanding of the region, helping in policy making and designing interventions. This analysis is dependent upon reliable and up-to-date surveys, which are lacking in the economically challenged areas of the world (Jean et al.,, 2016). One of the important, but laborious to gather, statistics are population densities and built-up area, especially in either densely constructed areas or scarcely populated areas. An accurate and up-to-date mapping of the built-up areas is necessary for the effective disaster (e.g, flood or earthquake) relief, urban food security, and estimation of effects of the urbanization on the farmlands, forest volume, and population. Recently, where there has been a surge in using machine learning and satellite imagery to discover the economic and social pattern such as poverty (Jean et al.,, 2016), slavery (Boyd et al.,, 2018), population spread, and large-scale urban patterns (Albert et al.,, 2017), there have been some successes in built-up-area estimation and building detection (Zhang et al.,, 2017). Unlike before mentioned works that rely on the collective features of the image to regress on the value, building detection requires detailed visual analysis, more accurately labeled data and respectable-resolution imagery. Over the years accurate results for the prediction of land-use and land-cover maps, such as (Zhou et al.,, 2018; Längkvist et al.,, 2015; Albert et al.,, 2017) have been presented. However, either these are image classification based approaches or techniques that are restricted to just segmenting out the areas without coming up with a realistic count of structures. Furthermore, these approaches are not able to capture changes inside the urban regions.

Counting allows fine-grain urban population analysis and detailed view of change occurring within the urban and rural centers, without explicitly tackling the complex task of the individual building segmentation. It is a better surrogate for the population analysis, more helpful in disaster management (damage and destruction estimation), urban food security analysis, and allow complex economic analysis of different parts of the city (indirectly allowing us to understand how much land is being used by each building).

Several datasets have been introduced by different researches for various remote sensing applications. Available satellite imagery datasets include (Yang and Newsam,, 2013), (Zou et al.,, 2015), (Xia et al.,, 2017), (Rottensteiner et al.,, 2012), (Lam et al.,, 2018) and (Zhou et al.,, 2018) comprising of land-use type class labels, bounding-boxes for overhead object localization and segmentation masks for categories like road, vegetation and buildings. Note that these land-use and land-cover datasets cover regions where buildings are separated from each other, or use hyper or multispectral imagery. Although these datasets are challenging and useful, none of them addresses the important problem of building counting in satellite imagery, especially in congested regions.

We venture into estimating the density of the buildings in the visible spectrum satellite imagery and present our counting results on the diverse set of images taken from sparsely to densely populated areas across the globe. We propose a deep regression based network and two new attention based re-weighting techniques to achieve building counts. To do a thorough evaluation of our proposed approaches, we have collected a new large dataset of satellite imagery capturing built-structures of different densities (low, medium, and high) as well as including scenes without any built structure. Furthermore, we have provided detailed annotations for building counts for each satellite image. To our knowledge, it is the first time challenging task of building counting has been handled at this scale. Our work exploits recent developments in the Deep Learning (LeCun et al.,, 2015) and propose the Convolutional Neural Network (Krizhevsky et al., 2012a, ) based solution for estimating the number of buildings in the region. In summary, our work has the following technical contributions.

We propose three new convolutional neural network based approaches for building counting. Firstly, we propose deep regression based counting. Secondly, we propose to employ attention network and introduce two new attention based re-weighting techniques to count the number of buildings. 2. 2.

We propose large, diverse satellite imagery-based dataset with the hand-counted number of buildings. 3. 3.

Extensive experiments to evaluate different approaches are performed. Experimental results demonstrate that our approach achieves state-of-art results as compared to competitive baselines. 4. 4.

Since we automatically estimate the built-regions through attention networks, We require only the image level count information, unlike previous methods, (Li et al.,, 2016; Sam et al.,, 2017; Liu et al., 2018b, ), which require sub-image level information .

In what follows, we first provide a detailed review of existing techniques, details of our dataset collection is shared in Sec. 3 and in Sec. 4 we present our proposed methodologies along with detailed implementation details.Sec. 5 consists of results and their analysis. Finally Sec. 6 concludes the paper.

2 Related Work

Identifying urban markers in satellite imagery has been explored extensively. Most of these work differ in the quality of imagery, sensor, resolution of the imagery and the granularity at which results are reported. The authors Head et al., (2017) used high-resolution satellite imagery and night time satellite imagery as input to train CNN based deep neural network for predicting the economical markers representing human development. Gueguen and Hamid, (2015) proposed trees-of-shapes features to perform the damage detection using both the pre and post event satellite imagery. LaLonde et al., (2018) performed object detection in the wide area motion imagery. Cheng et al., (2016, 2019) proposed the rotation invariant CNN for object detection in VHR remote sensing images. Note that since we automatically estimate the density built-regions through attention networks, we require only the image level count information instead of expensive object level annotations. Similarly, there is extensive literature on building detection mainly relying on multi-spectral imaging. Built-up area detection and building detection systems vary on the basis of the information they use (visible spectrum, multi-spectral imaging, DEM (Digital Elevation Model), LiDAR), features they extract (lines, corners, texture, etc,), machine learning applied on these features, the resolution of the input and output and the final objective they achieve. Global Human Settlement Layer (Pesaresi et al.,, 2016) has been constructed using the Landsat imagery of multiple years, giving the percentage of built-up coverage in each pixel (38.2 m spatial resolution). Deep Learning based Semantic Segmentation (Long et al.,, 2015; Badrinarayanan et al.,, 2017) have been applied to the satellite imagery for land-use and land-cover analysis (Audebert et al.,, 2016). Also, Yang et al., (2018) modified (Badrinarayanan et al.,, 2017) to design a two-stage CNN that first segments land-cover type and then the segmented land-cover polygons are further processed for land-use classification. A boundary detector based semantic segmentation model is trained by Marmanis et al., (2018) and in this model, Digital Elevation Model (DEM) is used with the input image to train the pipeline. There is, however as per our knowledge, no previous work on counting the buildings from satellite imagery, let alone using the RGB spectrum.

Initial works relied on detecting the local features, such as edge, lines, corners (Huertas and Nevatia,, 1988; Sirmacek and Unsalan,, 2009), or photo-metric properties (Müller and Zaum,, 2005; Ghaffarian and Ghaffarian,, 2014) and then intelligently combining these information together (Izadi and Saeedi,, 2012; Krishnamachari and Chellappa,, 1996). Izadi and Saeedi, (2012) looked for intersection of the lines and shadow cues to define the buildings. Müller and Zaum, (2005) used the low-level features to capture the geometric (roundness, size), photometric and structural properties (shadow, and presence of other houses that is neighborhood). They assumed that roof-tops are more visible in the red channel, constraining themselves to one particular type of roofs. This is not true in general, as seen in fig. (2). Ghaffarian and Ghaffarian, (2014) proposed the variation of the FastICA algorithm and k-means to detect buildings in monocular high-resolution Google Earth Images using the LUV space. Although their source of imagery is similar to us, they are relying on low-level and hand-crafted features and their objective is to achieve only built-up area segmentation and not the estimate of the number of buildings.

Another direction is to group the low-level features intelligently for detecting buildings. Krishnamachari and Chellappa, (1996) used Markov Random Fields to detect buildings by combining the straight line segments detected from the edge map of an aerial image. In designing their objective function, they used the insight that only line-segments near each other needs to be combined and such combination should encourage rectangular shapes. Ok, (2013) performed multiple graph-cuts using the shadow-cues to detect buildings in Very High-Resolution multi-spectral images. Most of the previous works in building detection relied on the shadow detection (Ok,, 2013; Müller and Zaum,, 2005; Huertas and Nevatia,, 1988; Ngo et al.,, 2017; Irvin and McKeown,, 1989) or shadow cues (Chen et al.,, 2014).

However, shadows depend upon the position of illumination source at the timing of the image capture. Several research works use more than just one optical Sensor (Ngo et al.,, 2017; Ok,, 2013) and rely on multi-spectral and $/$ or the high-resolution satellite imagery. Such systems mostly end up in segmenting the built-up area and fail in the cases where buildings are connected very close to each other.

Closest to our work is by Xia and Wang, (2018), where authors segment out building instances, but they are using high-resolution imagery and difficult to get mobile LiDAR dataset. We solve the problem of counting the number of buildings, from satellite or aerial imagery in RGB space. This is a difficult problem, especially for densely populated areas in general (Zhang et al.,, 2017), or more specifically where the buildings are connected. The architectural and cultural designs impact how buildings appear from above, making it difficult to separately identify the boundaries of each building.

Counting objects from the images or videos (Idrees et al.,, 2013) is an interesting and important problem. However, most of the recent works have been targeted towards the crowd counting, perhaps, because the dataset preparation for such is easier or the problem could be relegated to the counting of heads. Whereas the building does not have any such distinctive sub-part like a head. Counting objects could be a complex or easy problem depending upon the sample. If objects are separable, a simple method is to detect objects and count them, for instance, see (Hsieh et al.,, 2017; Liu et al., 2018b, ; Sam et al.,, 2017). Furthermore, the recent success of deep learning based object detectors (Ren et al.,, 2017; Redmon et al.,, 2016; Hu and Ramanan,, 2017), allows objects detection based counting methods to be more accurate. Moreover, many of these work exploits the structure of the object, for example, in crowd counting (Idrees et al.,, 2013). Hu and Ramanan, (2017), and Attia and Dayan, (2018) uses the head shapes which are consistent across humans to detect heads and use that for the counting. They also use perspective information i.e., the density of humans per pixel in patches far away from the camera will be more than the density of humans in patches near the camera. Many research works use the fact that the humans will be standing up and will be straight (in a way aligned to the axis). Data-set collected by the Marsde et al., (2018) also has the different densities of perspective properties. The same is true for the car counting problem. In sharp contrast, perspective information is not useful for satellite image building counting. Especially in the case of irregular construction, where houses of all sizes are build next to each other.

DecideNet (Liu et al., 2018a, ) comes closest to our work, in terms of trying to find a middle path between the count by detection and regression. However, that’s where the similarity ends. Their algorithm relies on the object detection pipeline based on detecting the heads. Furthermore, their regression pipeline requires that the dots be placed on the heads of the humans. Both of these conditions are not applicable to our problem. There is no such visible ”head” in terms of buildings and our pipeline does not require individual dots be given for each house. Our method relies on exploiting built-area segmentation (not to be confused with instance level segmentation) for the attention. It only requires the total count concerning the image and not the sub-image count information as in previous counting methods.

Datasets play a vital role in the research, development, and evaluation of new technologies. Lu et al., (2018) proposed remote sensing captioning dataset, where each image is accompanied with five sentences. Xia et al., (2017) proposed a new large-scale scene classification dataset which includes 30 scene classes such as beach airport, desert, and farmlands. Other scenes classification datasets include UC-Merced (Yang and Newsam,, 2010), NWPU-RESISC45 (Cheng et al.,, 2017), WHU-RS (Sheng et al.,, 2012) and RSSCN7 (Zou et al.,, 2015) datasets. Zhou et al., (2018) presented a 38 classes dataset for remote sensing image retrieval applications. Finally, Tian et al., (2017) put forward a new dataset for cross-view image geo-localization. In contrast, we collect a new challenging dataset that captures built areas of various densities from the satellite view.

3 Data Collection

The proliferation of deep learning libraries has enabled many to train for classification and regression tasks on the basis of the hyper or multi-spectral images without explicitly hand-designing, weighing, and fusion functions of different channels. However, utilization of the deep-learning based methods in remote sensing has been challenged by the absence of the large-scale datasets (Demir et al.,, 2018). To the best of our knowledge, there is no publicly available dataset for counting buildings using satellite imagery, covering different geographical areas and a variety of built-up densities. Therefore, we have collected a new geographically diverse dataset by extracting, sorting and marking the satellite images. Although we mainly focused on counting the number of buildings, our dataset can be used for other remote sensing applications as well, such as more accurate surrogate for the population density estimation or neighborhood type estimation. With the development of the latest high-quality hyper-spectral optical sensors, good quality high-resolution satellite images are publicly available for several developed countries. However, still, the majority of publicly available satellite images are of low-quality as shown in Fig. 2 in comparison to the ground image datasets available today. In this paper, we focus on RGB-images of a resolution (m per pixel) which might be considered as VHR with respect to the satellite imagery, however, is of low quality when we consider sharpness of edges or noise, especially for the task on hand that is counting the number of buildings.

Note that the same building looks quite different depending on the position of the satellite and the time of the day the image was taken. The height of the building, shadows, degree of separation and types of boundaries between the buildings makes these images challenging. To make our dataset realistic, we have collected satellite images at different times and from geographically different locations depicting built-areas of different diversity. Below, we provide details about our dataset collection and annotation process.

3.1 Sorting and collecting data

The satellite images are collected from regions including geographical and architectural differences that cover natural, urban and desert landscapes, Fig. 2. Collecting images from different locations induces scene-type variability and makes dataset challenging to evaluate. We selected different regions from these geographically different locations. From those regions, we have downloaded highly dense, moderately-dense and low-dense areas using Google Earth API. All these images are captured at zoom level 19 that covers 0.3m per pixel. A building in a densely populated urban residential area covers approximately $25$ - $30$ $m^{2}$ , while in hilly regions and other rural areas, the range of area covered decreases to $15$ - $20$ $m^{2}$ .The tile size of $100\textit{m}\times 100\textit{m}$ is selected on Google maps to capture all types of small, medium and large built-structures. The downloaded image size is $336\times 336$ . Table 1 shows the details of the number of images downloaded from different landscapes with their areas in kilometer square.

Manually downloading geo-located images from Google maps is a daunting task, therefore a Matlab® based tool is designed that calls the Google Earth API to automatically download geo-located images. Specifically, at given size and scale, the image array and its corresponding latitude and longitude vectors are saved. Note that this pixel-level geo-location is very useful for visualizations and post-processing purposes.

Table in Fig. 3 shows how challenging the collected dataset is. The built-up region is computed using the satellite segmentation network explained in later sections. The percentage of area covered by the detected built-structures is compared with the labeled count of buildings. The comparison between both is made on the varying size of structures. As the satellite images are collected from various locations, the dataset covers a variety of different architectural designs of buildings with varying sizes that are difficult to learn from. As shown in the table in Fig. 3, there exist images containing few buildings but cover nearly $50-80$ % of the area.

3.2 Data tagging

Thorough annotation of the collected dataset was performed. Specifically, we designed a Matlab based GUI to tag ground truth building count. To ensure good quality annotation, each image was annotated by at least two annotators. In Fig. 3, we provide the detailed statistics of our dataset. The Pie chart on left shows the percentage of data that belong to a specific count window. As it can be seen that our dataset contains images with varieties of house count; from no built structure to a large count of built-structures. The Table on the right of Fig. 3 shows the number of images in our data with a specific percentage of area covered by structures relative to the number of buildings in them. Note that we obtain built-up ratio using our Satellite Segmentation Net (Sec. 4.2.1).

4 Methodology

The primary goal of our paper is to achieve a precise count of the number of buildings in each satellite image, which is a challenging problem, as usually, satellite imagery is of low-resolution and quality as compared to the generally available ground imagery. Most importantly, there is no visible space between the neighboring buildings making it difficult to delineate accurately each building. Therefore, we propose to map deep visual features to real numbers representing the count of built-structures in the image. As an input to our regression model, initially, we took deep features from DenseNet (Huang et al., 2017a, ) and map them to house counting problem through a fully connected neural network. Although, we achieve decent counting results (see table 3), however, the DenseNet gives equal importance to all of the image regions, therefore, resulting in the loss of accuracy. Our key intuition is that for the house counting purpose, features belonging to built-up regions are more important than the features originating from the rest of the image regions such as form fields, streets etc. Therefore, we propose two deep regression approaches using attention based re-weighting (ABW), where we decrease the influence of deep features from non-built areas such as fields or streets regions; thus enabling the algorithm to predict count with more accuracy. Our experimental results validate our intuitions. Below, we provide details for each of our proposed approach.

4.1 Deep Regression Counting (DRC)

We pose built-structure counting as a deep regression problem, that is, training deep learning based models with the regression as an output layer. Transfer learning (Bengio,, 2012) is performed by extracting the deep features from global average pooling layer of DenseNet (Huang et al., 2017a, ), pre-trained on ImageNet. Note that ImageNet (Krizhevsky et al., 2012a, ) is a large dataset with 1K class labels. Many recent works like (Huang et al., 2017b, ) indicate that features learned by CNN models, e.g. VGG (Simonyan and Zisserman,, 2014) and AlexNet (Krizhevsky et al., 2012b, ), trained on such large datasets can be used to perform transfer learning for tasks with limited training data. DenseNet was used because of its reported high accuracy and computational efficiency.

Features extracted are fed into the fully-connected (FC) neural network. We used a three-layered network having 512, 32, 1 units respectively (Fig. 4). We used 60% Dropout layer between FCs. Relu is used as an activation layer after the first and second fully connected layer. No activation function is applied at the output layer. We have not used the ImageNet mean values to normalize our remote sensing data. Though initializing weights with such datasets is helpful instead of random values but the mean of ImageNet is a pure representation of day to day ground images. In our experiments, normalizing satellite imagery using these values disrupts the input and this affects the accuracy of the model.

4.2 Deep Regression Counting by Attention

Deep Regression counting suffers from the problem of giving equal weight to all the features whether they belong to a built-up region or not. Attention-based architectures help neural network concentrate on the task at hand and not impacted by the noise. To exploit local information for precise building count, we propose to use built-up region segmentation probabilities as the attention.

4.2.1 Satellite Segmentation Net (SS-Net):

We train compact VGG-based (Simonyan and Zisserman,, 2014) fully convolutional neural network (Long et al.,, 2015) to perform pixel-wise built-up region classification. We call this network SS-Net. The output convolutional layer of this network predicts if a 64 $\times$ 64 input patch belongs to a built-structure or not. The SS-Net is trained on low-resolution Village-Finders dataset (Murtaza et al.,, 2009). The original size of the images in data is $512\times 512$ . We randomly crop the patches of size $64\times 64$ and $128\times 128$ from the image and use segmentation mask associated with them to generate labels. The weights of the network are initialized with pre-trained VGG network (trained on ImageNet) weights and the data is normalized by computing its mean values instead of the ImageNet ones. During training, each patch in the training set is augmented four times by flipping, inverting and rotating the patch 45 degrees. Inspired by (Dupret and Koda,, 2001) and (Harwood et al.,, 2017), to cater to the problem of unbalanced data, a bootstrap technique to do hard negative mining method is applied. Specifically, after every 15 epochs, new samples were evaluated and all with a false positive response were added as negative examples of the training set. Fig. 4 shows the network architecture of the SS-Net.

During inference, when we present $64\times 64$ patch to the SS-Net, it returns the probability of this patch containing building(s) or part of the building. Since SS-Net is fully convolutional, it is capable of processing images of any size greater than $64\times 64$ pixels. The following equation gives the output size of the feature map at any layer $n_{out}=\frac{n_{in}+2P-K}{S}+1,$ where $n_{out}$ is output size of feature map, $n_{in}$ is size of input feature map, $P$ represents padding, $K$ shows filter size and $S$ represents stride. In our experiments, we used $P=1$ , $S=2$ for max pooling layers and $P=1$ , $S=1$ for convolution layers, and the value of $K$ depends on convolution layers. For instance, for an input image of $224\times 224$ , after 3 max pooling layers of stride 2 and kernel size 2, and following convolution layer of filter size $8\times 8$ and $1\times 1$ , we obtain probability maps of $21\times 21\times 2$ , where 2 represents number of channels. A probability map, $\mathcal{P}$ , representing the input image is generated for each image and bi-linear interpolation is performed to re-size the map to that of input image size. Qualitative results in Fig.4 demonstrate that our SS-Net can segment the built and non-built areas with very high accuracy. Table 2 demonstrates the accuracy and F1-score of SS-Net on village finders test set.

The building probability calculated on each pixel is used to improve the regression algorithm for counting buildings. In sections below, we discuss in detail our two proposed approaches that use output probability maps of SS-Net for improved building counts.

4.2.2 Global Weighted Average Pooling (GWAP):

Similar to Sec. 4.1 the pre-trained DenseNet is used to extract the features. However, in this algorithm attention map generated by the SS-Net is used to perform Attention Based Global Weighted Average Pooling (GWAP) over the features. To achieve GWAP, we first multiply each feature map, extracted from the DenseNet, with probability map generated by SS-Net and then compute the average of each channel independently. This results in dimensionality reduction while gathering of the spatial information. Each value in the pooled vector corresponds to the density of constructed regions in a satellite image. These activation maps and SS-Net output probability maps for a typical image are shown in Fig. 5 under the ’Counting by Attention’ pipeline. Similar to GAP (Lin et al.,, 2013), GWAP also directly corresponds to the features learned. However, in GWAP features from different locations of the image are given different weights. As shown in Fig. 5, DenseNet produces features maps (output of last convolutional layer) which are agnostic to the built structure while SS-Net provides high probability score on the built area. Combining these two maps filters out the activation values of DenseNet from non-built areas. This meaningful representation is then fed into the 3-layer fully connected neural network, with 512, 32 and 1 units referred as regression pipeline. The yellow block along with blue block (Fig. 5) displays the network architecture of GWAP.

4.2.3 Cross Channel Parametric Pooling (CCPP):

GWAP allows the algorithm to consider only the built area, however, it suffers from a lack of accuracy. One reason is the effect of averaging i.e. features representing buildings at different locations are summed up. However, recognizing them separately is required for accurate building count, especially for the densely built-up areas. Instead of predicting one single value for the whole image, if one can predict the count at different locations of the image, then the final number should be a summation of these counts; thus reducing the effect of averaging. However, we only have one count per image and not the count at each location. To counter this shortcoming, we design a network that can take care of across the channels correlation and spatial layout of the feature map. Specifically, we employ convolution of kernel size $1\times 1$ , let’s call it $C_{1}$ that outputs a single activation map. This activation map is presented to the fully connected regression pipeline, predicting the final count. Note that the architecture of regression pipeline is same in all methods, however, minor changes regarding the activation function and optimizer were experimented and are discussed in the implementation details.

The output of the layer $C_{1}$ is a single channel, visualized in the green block (Fig. 5). This convolutional layer $C_{1}$ performs learnable interactions within the weighted feature volume at every location. This layer learns the combination and comparison of all sizes of built-structures that are captured by the weighted feature map. Its response is different at different parts of the images, corresponding to the density of the buildings at that location.

4.2.4 Counting by Attention with FusionNet:

All the models discussed above suffer from one or other shortcoming. GWAP is unable to give credence to the local information. CCPP, where handles the local information, is challenged when the images with low count are presented, much due to the lack of the larger perspective (Table 3). The attention based pipelines (Sec. 4.2.2 & 4.2.3) do better than generic deep regression pipeline in our case, by detecting the areas with buildings. However, as shown in Fig. 4, our building segmentation system takes away the other useful information too, such as the location of streets or roads, or other markers highlighting the natural boundaries of the buildings. FusionNet has been designed to counter the shortcomings and enhance the benefits, by fusing the features extracted from each method. These fused features are processed by the fully connected regression network, outputting the final count. After concatenating the output of FC layers, the number of units in the fused layer is 1536. Finally, the fused layer is fed into the 3-layer fully connected neural network, with 512, 32 and 1 units referred before as regression pipeline. The network architecture of FusionNet is displayed in Fig. 5.

All above approaches when fused together complement each other hence improving the learning of regression pipeline. During training, the penalty is back-propagated collectively where all or any of the streams results in the prediction of an erroneous count.

4.3 Implementations details

All regression-based models are trained on $336\times 336$ image size in pixels which correspond to $100\textit{m}\times 100\textit{m}$ area covered on the ground with a resolution of 0.3 m per pixel. While using DenseNet features, we did not use any normalization technique as per our experiments, normalizing remote sensing data with ImageNet mean-values disrupts the images. To prevent the model from over-fitting, Dropout layers (Srivastava et al.,, 2014) are applied with a ratio of $0.6$ on fully connected layers. Apart from FusionNet, all models have the same regression pipeline comprising of three FCS of 512, 32 and 1 units. In FusionNet, last fully connected layer of all three blocks (DRC, CCCP, GWAP) contains 512 units. Concatenating them creates an input layer of 1536 dimension, which is input to fully connected layer of 512 units, followed by 32 units and 1 unit fully connected layers. While training the Deep Regression Counting, we use ReLu as an activation function. However, for all attention based models, leaky-ReLu with a ratio of $0.3$ is used. This counters the high activation values resulting from the DenseNet features and its product with the probability maps. For training the built-up area segmentation network, we normalize by subtracting the mean of the whole train set from each image. SS-Net is trained with a batch size of 16 on patches of size 64, so only the first three blocks of the VGG-16 are used. The learning rate of $1e^{-5}$ is used with optimizer SGD to train the SS-Net. The training data, for counting, was augmented by flipping, inserting images, and rotating them at angles 90 degrees and 270 degrees, increasing its size five times. All the experiments were performed using Keras with tensorflow as a backend. Chanel-wise cross-entropy loss function is used for training SS-Net. For training DRC, mean squared is used. Furthermore, all attention based training networks (GWAP, CCPP, and FusionNet) are trained on root mean squared error.

5 Results and Analysis

A thorough comparative analysis of proposed approaches is performed by evaluating their results on the test set of 531 satellite images, extracted from our collected dataset. In Table 3, we provide a quantitative comparison of all four proposed approaches, by calculating the mean absolute error (MAE), total absolute error (TAE) and R-squared measure. The MAE decreases and R-squared values improve, as we move from deep regression counting to FusionNet.

In order to perform in-depth analysis of our results, the test set is divided into three ranges on the basis of ground truth count of the buildings; (a) Low-Count (0 to 30): less built, (b) Medium-Count (31 to 60) : reasonably populated and (c) High-Count (greater than 60) : densely built. Out of 531 images 416, 100 and 15 are in the Low-Count, Medium-Count and High-count range, respectively. TAE for all four approaches on each set is computed separately. The 531 testing images cover a total of 8945 structures. MAE is computed by dividing the total absolute error with the total number of images. As compared to deep regression counting, both attention re-weighted counting has better results. Finally, the fusion of all three approaches further decreases the mean absolute error (see Table 3). The proposed approach is quite efficient; DRC, GWP and FusionNet took 0.07 (0.02), 0.8 (0.026), and 0.9 (0.029) sec/image (sec/Km) respectively.

5.1 Comparison and Analysis of results

As indicated by the MAE results, Table 3 and Fig. 6, introducing attention mechanism considerably decreases the MAE (3.6%) and increases the R-squared value. Fig. 7 shows the images corresponding to minimum and maximum MAE, for all of our proposed models. On fine-grain analysis, it is observed that the GWAP network is accurate for the Low-Count images whereas the CCPP network is predicting with lesser TAE in the Medium-Count and High-Count images (Table 3). For the low-density images, where both CCPP and GWAP are much better than DRC, GWAP’s TAE is much less than that of CCPP. With the involvement of attention the MAE between the ground truth and predicted count decreases generally but the CCPP seems to be distracted while over counting the structures in some of the images. For example, in Fig. 7, vehicles parked on the road are misleading the model. However, for both medium and high-density images, the number of TAE of CCPP is much less than the GWAP, indicating that much more detailed local information is needed for counting where the density of buildings is more. To capitalize on the complementary nature of CCPP and GWAP, FusionNet is trained which combines the deep regression counting with both attention models. Fig. 6 shows the comparison in the MAE of these models.

We retrained DRC on the mean (of ImageNet) subtracted data. MAE of this mean-subtracted DRC rose from 4.14 to 16.98. Deep regression was performed on the features extracted from SS-Net. This resulted in an increase in MAE to 5.5 since these features do-not capture inner-structure in the segment

5.2 Counting in large neighbourhood

In order to show generalization capacity and effectiveness of our model, we test our approach on a portion of Cairo’s densely populated region. The satellite image is of the size $1008\times 3024$ pixels, covering $302.4\times 907.2m^{2}$ area. The region covered in this testing tile is diverse, containing both small and large structures in densely and moderately populated areas. Ground truth is created by manually counting the buildings, and came up to be 656. Our approach, FusionNet predicted 675 buildings which are quite close to the ground-truth. In order to perform detailed analysis, we divide the image into 27 cells, where each cell is of size $336\times 336$ pixels. FusionNet’s prediction for each of the cell is compared with the ground-truth count of the buildings in that cell, the ground-truth is achieved by hand-marking each cell in the image. Predicted count is overlayed on the map for visualization, Fig. 1, by assigning different colors according to the different predicted counts in each window. For a quantitative comparison, graph of predicted count and ground truth count is show in Fig. 1, indicating that predicted values closely follow the ground-truth. We argue that cell 18 is intensely populated and contains irregular construction which makes it difficult even for the human annotators to count. High accuracy on a large image outside of training and test set, demonstrates the generalization capacity and robustness of our proposed approach.

6 Conclusion

In this paper, we have attempted to solve a difficult problem of counting buildings from satellite imagery. The diversity in the shape of the urban structures, variations in city planning and sensor response, makes the problem challenging. We have introduced a new challenging benchmark dataset capturing different geographical regions and areas with different building counts at various build densities (dataset will be made publicly available). Instead of using deep learning as a black box, we have presented an attention based mechanism, based on insights of how Deep Convolutional Neural Networks work so that our model can capture variations in the urban-structures. Our final solution, FusionNet, combines the information captured by different pipelines at different granularity, making it robust to the densely built buildings, as well as to sparsely built areas, from large structures (covering a large area) as well as to small structures. FusionNet is able to handle a variety of the roof types including a difficult case of flat roofs especially when the buildings are interconnected. Future directions include improving the image quality through super-resolution before feature computations and investigation of other pooling techniques to improve building counting.

Acknowledgment: We greatly appreciate discussion and useful comments provided by Hamza Rawal, Maria Zubair, Komal Khan and Umar Saif.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Albert et al., (2017) Albert, A., Kaur, J. and Gonzalez, M. C., 2017. Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 1357–1366.
2Attia and Dayan, (2018) Attia, A. and Dayan, S., 2018. Detecting and counting tiny faces. ar Xiv preprint ar Xiv:1707.08952.
3Audebert et al., (2016) Audebert, N., Le Saux, B. and Lefèvre, S., 2016. Semantic segmentation of earth observation data using multimodal and multi-scale deep networks. In: Asian Conference on Computer Vision, Springer, pp. 180–196.
4Badrinarayanan et al., (2017) Badrinarayanan, V., Kendall, A. and Cipolla, R., 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 2481–2495.
5Bengio, (2012) Bengio, Y., 2012. Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp. 17–36.
6Boyd et al., (2018) Boyd, D. S., Jackson, B., Wardlaw, J., Foody, G. M., Marsh, S. and Bales, K., 2018. Slavery from space: Demonstrating the role for satellite remote sensing to inform evidence-based action related to un sdg number 8. ISPRS Journal of Photogrammetry and Remote Sensing 142, pp. 380 – 388.
7Chen et al., (2014) Chen, D., Shang, S. and Wu, C., 2014. Shadow-based building detection and segmentation in high-resolution remote sensing image. Journal of Multimedia 9, pp. 181–188.
8Cheng et al., (2017) Cheng, G., Han, J. and Lu, X., 2017. Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE 105(10), pp. 1865–1883.