End-to-end Cloud Segmentation in High-Resolution Multispectral Satellite Imagery Using Deep Learning
Giorgio Morales, Alejandro Ram\'irez, Joel Telles

TL;DR
This paper introduces a new high-resolution satellite cloud dataset and an end-to-end deep learning segmentation method that significantly improves accuracy in cloud detection tasks.
Contribution
It provides the CloudPeru2 dataset and a CNN-based segmentation approach using Deeplab v3+ for automated cloud detection in satellite imagery.
Findings
Achieved 96.62% accuracy in cloud segmentation
Outperformed existing methods in precision and specificity
Provided a valuable dataset for future research
Abstract
Segmenting clouds in high-resolution satellite images is an arduous and challenging task due to the many types of geographies and clouds a satellite can capture. Therefore, it needs to be automated and optimized, specially for those who regularly process great amounts of satellite images, such as governmental institutions. In that sense, the contribution of this work is twofold: We present the CloudPeru2 dataset, consisting of 22,400 images of 512x512 pixels and their respective hand-drawn cloud masks, as well as the proposal of an end-to-end segmentation method for clouds using a Convolutional Neural Network (CNN) based on the Deeplab v3+ architecture. The results over the test set achieved an accuracy of 96.62%, precision of 96.46%, specificity of 98.53%, and sensitivity of 96.72% which is superior to the compared methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConditional Random Field · Dilated Convolution · Dense Connections · Feedforward Network · DeepLab
End-to-end Cloud Segmentation in High-Resolution Multispectral Satellite Imagery Using Deep Learning
Giorgio Morales, Alejandro Ramírez, Joel Telles
National Institute of Research and Training in Telecommunications (INICTEL-UNI)
National University of Engineering, Lima, Peru
Email: [email protected]
Abstract
Segmenting clouds in high-resolution satellite images is an arduous and challenging task due to the many types of geographies and clouds a satellite can capture. Therefore, it needs to be automated and optimized, specially for those who regularly process great amounts of satellite images, such as governmental institutions. In that sense, the contribution of this work is twofold: We present the CloudPeru2 dataset, consisting of 22,400 images of pixels and their respective hand-drawn cloud masks, as well as the proposal of an end-to-end segmentation method for clouds using a Convolutional Neural Network (CNN) based on the Deeplab v3+ architecture. The results over the test set achieved an accuracy of 96.62%, precision of 96.46%, specificity of 98.53%, and sensitivity of 96.72% which is superior to the compared methods.111This paper is a preprint (submitted to the INTERCON 2019 conference, Lima, Peru). IEEE copyright notice. Ó 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted.
Index Terms:
Cloud segmentation, end-to-end learning, satellite image.
I Introduction
Since the very beginning of remote sensing, clouds represents the most overwhelming type of noise in optical satellite imagery because it blocks everything beneath them. On the other hand, the high variance in its spectral response could add statistical noise into a database if some of its pixels get into it. For those reasons, filtering clouds through a detection process is one of the most traditional problems in remote sensing.
In the literature, this problem has been addressed from many perspectives; from empirical thresholded decision trees [1, 2], fuzzy logic [3], time series (if data available) [4], and machine learning [5, 6, 7] to a more recent approach: deep learning [8, 9, 10]. Even though some of the previous works achieved outstanding results, due the high risk that clouds represents, generating more accurate models for clouds detection is still valuable to enhance the results of deeper remote sensing methods/algorithms.
Due to the fact that some institutions, such as the National Commission for Aerospace Research and Development (CONIDA) of Peru, process a great number of satellite images daily, it is necessary to develop a method to automatically and rapidly obtain their correspondent cloud masks. For this, we propose an efficient cloud segmentation method for high-resolution multispectral satellite images using a trainable end-to-end convolutional neural network (CNN). In order to train our network and compare its performance with other methods, we propose a large dataset consisting of 22,400 image patches extracted from PERUSAT-1, a Peruvian satellite managed and supervised by CONIDA.
II Proposed Method
II-A CloudPeru2 Dataset
A PERUSAT-1 scene has four spectral bands: red (0.63-0.7), green (0.53-0.59), blue (0.45-0.50) and NIR (0.752-0.885). The spatial resolution of the multispectral bands is 2.8 m per pixel and that of the panchromatic band is 0.7 m per pixel.
We used 153 PERUSAT-1 scenes of variable sizes (from to pixels) and from different geographies to extract 2800 image patches of pixels and create the CloudPeru2 dataset [11]. The scenes were previously orthorectificated and adjusted to reflectance values with atmosferic correction. Each image patch has a correspondent hand-drawn shadow mask. Image samples from the dataset are shown in Fig. 1. Nevertheless, for this work we used data augmentation to increase the dataset size in order to avoid overfitting problems. In that sense, we rotated each patch [math], [math] and [math], and flipped horizontally each one so that we get a total of 22,400 patches. We split 90% of the data to create the training set, 5% to the validation set and 5% to the test set.
In a previous work [10], we presented the CloudPeru dataset, which was used to classify small image patches as clouds or non-clouds. This dataset consists of 476,422 image patches of pixels extracted from only 15 different PERUSAT-1 scenes. In contrast, the CloudPeru2 dataset presents a greater number of scenarios (e.g. snow and ocean) and a bigger patch size; besides, it was specifically created to solve a segmentation problem.
In order to appreciate and verify the diversity of scenarios of the CloudPeru2 dataset, we utilized t-SNE to sample the images by categories in a 2-D space, as shown in Fig. 2. For this, we used a small CNN of the same architecture as that of the network presented in [10], and we trained it in the SAT-6 airborne dataset [12]. Then, we resized the images of our dataset to pixels and used the trained network to extract a vector of 128 features (i.e. the output of the first fully connected layer ’’) from them. Finally, the extracted feature vectors are mapped to the 2D space with t-SNE.
II-B Proposed CNN for Segmentation
We propose a semantic level segmentation of clouds in satellite imagery using a Convolutional Neural Network (CNN). The architecture of our network is the same as that used in [13] with the only difference that instead of three channels, our network uses inputs of four channels (R, G, B, and NIR). This CNN is based on the Deeplab v3+ architecture [14], which integrates an encoder, an atrous spatial pyramid pooling module (ASPP), and a decoder.
In Fig. 3 we show the proposed network architecture. Convolution blocks are denoted as “CONV;” inverted residual units, as “IRU;” and atrous separable convolution blocks, as “ASC”. The inverted residual unit (IRU) [15] expands the input number of channels using a convolution, then apply a depthwise convolution (the number of channels remains the same), and, finally, apply another convolution that reduces the number of channels. The atrous separable convolution (ASC) is a depthwise convolution with atrous convolutions followed by a pointwise convolution. The output number of filters of each block is reported using the hash symbol (”#”). The stride of all convolutions is denoted as “s.” Blocks marked with “S” are “same padded,” which means that the output is the same size as the input. “ReLU” represents a standard rectified linear unit activation layer and “BN” a batch normalization layer.
III Results
III-A Training Results and Metrics Comparison
The proposed algorithm was implemented using Python 3.6 on a server with Intel Xeon CPU E5-2620 at 2.1 GHz CPU, 128GB RAM and two NVIDIA GeForce GTX 1080 GPU. The proposed CNN was trained using an Adam optimizer with a learning rate of 0.003, a momentum term of 0.9, a momentum term of 0.999 and a mini-batch size of 8. Figure 4 shows the evolution of network accuracy and loss over 300 epochs.
In addition, we compared the ground truth with other four cloud detection methods. The first method [16] proposes a new deep residual architecture called CloudNet to semantically segment clouds. Its main unit uses a convolutional block of four channels followed by an ASPP module with seven dilation rates, whose results are concatenated along with the output of the first convolution. By doing so, it preserves the spatial information since it does not use any pooling or strided operation. We implement a network with 12 of these units, according to best achieved results reported by the authors. The second method [10] subdivides the image in superpixels, generates -pixel patches from each superpixel, and classifies each patch as cloud or non-cloud using a small CNN. The third method [7] calculates a set of texture and spectral descriptors and process them using a fully-connected neural network. Finally, the fourth method [17] uses a progressive refinement scheme.
We quantitatively compare all methods with respect to the ground truth using five metrics in the validation set: accuracy (ACC), precision (PREC), recall/sensitivity (SN), and specificity (SP), as shown in Table I. The ACC ratio indicates the correctly predicted observations against total observations; the PREC ratio indicates the correctly predicted positive observations against the total predicted positive observations; the SN ratio indicates the correctly predicted positive observations against the total actual positive observations, and the SP ratio indicates the correctly predicted negative observations against the total actual negative observations.
From Table I we observe that the greatest accuracy and sensitivity values correspond to our method (97.5% and 98.46%), evidencing a difference of more than three and eight percentage points, respectively, over CloudNET. The visual comparison of the cloud masks generated by all mentioned methods is shown in Fig. 5; these masks were generated from six different images with both low and high density clouds. It is observed that our method produces the most similar masks to the ground truth, specially when it comes to discern between snow and clouds (fifth column of Fig. 5), while other methods prioritize the segmentation of the most obvious high-density clouds. It is also worth mentioning that the most frequent type of error produced by our network is due to false positives, which can be proved by the fact that the lowest metrics of our network are the precision and specificity values; these errors are caused by small differences between the delineation of the borders of the ground truth and the generated masks. In the end, the results over the test set achieved an accuracy of 96.62%, precision of 96.46%, specificity of 98.53%, and sensitivity of 96.72%.
We would also like to state that although our version of CloudNet has only 6,077 trainable parameters and our method has 503,377, the amount of computation and memory required by our approach is inferior than that of CloudNet. For instance, when training CloudNet, we had to reduce the number of training samples to just 15,960, use a mini-batch size of 12, and use randomly cropped images of in order to reduce the training time and memory consumption. This is explained by the fact that CloudNet does not reduce the size of its tensors at any moment, which consumes a lot of computational resources; while having small tensors with more number of channels consumes far less memory. Therefore, a single epoch for training our network (1330 batches, mini-batch size of 16, and inputs of pixels) lasted 20 minutes, while for training CloudNet (1330 batches, mini-batch size of 12, and inputs of pixels) lasted 22 minutes.
III-B Cloud Segmentation on Satellite Scenes
We have trained a CNN to segment clouds on small patches of pixels; however, the width and height of a PERUSAT-1 satellite scene are normally greater than 6000 pixels. Therefore, we move a -pixel sliding window across the scene in both horizontal and vertical direction with a 50-pixel overlap. In each position, we get a cloud probability mask using the trained network. In the overlapped areas, we consider the maximum probability value in order to avoid discontinuities in the final mask. Finally, we apply a threshold of 0.5 over the entire mask. Figure 6 shows the final cloud segmentation mask of a complete satellite scene.
IV Conclusions
In this paper, we propose an efficient method for segmenting clouds in high-resolution multispectral satellite images semantically. For this, we trained an end-to-end convolutional neural network based on a simplification of Google's Deeplab v3+ network. When comparing the results produced by our network with those produced by other cloud segmentation methods using the novel large dataset that we have proposed, we conclude that we achieved the best performance metrics. This method was embedded into a user-friendly interface used by the National Commission for Aerospace Research and Development (CONIDA) of Peru, allowing them to process hundreds of satellite images automatically and rapidly.
Acknowledgment
The authors would like to thank the National Commission for Aerospace Research and Development (CONIDA) and the National Institute of Research and Training in Telecommunications of the National University of Engineering (INICTEL-UNI) for the support provided.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Z. Zhu, and C. Woodcock, “Object-based cloud and cloud shadow detection in Landsat imagery,” Remote Sens. Environ., vol. 118, pp. 83-94, 2012.
- 2[2] A. Fisher, “Cloud and Cloud-Shadow Detection in SPOT 5 HRG Imagery with Automated Morphological Feature Extraction,” Remote Sens., vol. 6, no. 1, pp. 776-800, 2014.
- 3[3] N. Singh, and A.A. Maxton, “Detection of Clouds and Cloud Shadow in Satellite Images using Fuzzy Logic,” Int. J. Advanced Res. Comp. Eng. Technol. (IJARCET), vol. 3, no. 4, pp. 1225-1228, 2014.
- 4[4] N. Champion, “Automatic Detection of Clouds and Shadows Using High Resolution Satellite Image Time Series”, in Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. (ISPRS), Prague, 2016, pp. 475-479.
- 5[5] Y. Yuan and X. Hu, “Bag-of-words and object-based classification for cloud extraction from satellite imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no.8, pp. 4197-4205, 2015.
- 6[6] T. Bai, L. Deren, K. Sun, Y. Chen and L. Wenzhuo, “Cloud Detection for High-Resolution Satellite Imagery Using Machine Learning and Multi-Feature Fusion,” Remote Sensing, vol. 8, no. 9, pp. 715, 2016.
- 7[7] G. Morales, S.G. Huamán and J. Telles, “Cloud Detection for PERUSAT-1 Imagery Using Spectral and Texture Descriptors, ANN, and Panchromatic Fusion,” in Proceedings of the 3rd Brazilian Technology Symposium - Emerging Trends and Challenges in Technology (BT Sym 17), Brazil, 2017, pp. 1-7.
- 8[8] M. Shi, F. Xie, Y. Zi and J. Yin, “Cloud detection of remote sensing images by deep learning,” in Proc. IEEE Int. Geosc. and Remote Sens. Symp. (IGARSS), Beijing, 2016, pp. 701-704.
