Nested Network with Two-Stream Pyramid for Salient Object Detection in   Optical Remote Sensing Images

Chongyi Li; Runmin Cong; Junhui Hou; Sanyi Zhang; Yue Qian; Sam Kwong

arXiv:1906.08462·cs.CV·January 8, 2020

Nested Network with Two-Stream Pyramid for Salient Object Detection in Optical Remote Sensing Images

Chongyi Li, Runmin Cong, Junhui Hou, Sanyi Zhang, Yue Qian, Sam Kwong

PDF

TL;DR

This paper introduces LV-Net, a novel deep learning architecture with a two-stream pyramid and nested encoder-decoder modules, designed to improve salient object detection in optical remote sensing images with diverse scales and cluttered backgrounds.

Contribution

The paper presents LV-Net, the first publicly available dataset for optical RSI salient object detection, and demonstrates its superior performance over existing methods.

Findings

01

LV-Net outperforms state-of-the-art methods quantitatively.

02

Constructed the first optical RSI salient object detection dataset.

03

Effective multi-scale and detail perception in complex remote sensing images.

Abstract

Arising from the various object types and scales, diverse imaging orientations, and cluttered backgrounds in optical remote sensing image (RSI), it is difficult to directly extend the success of salient object detection for nature scene image to the optical RSI. In this paper, we propose an end-to-end deep network called LV-Net based on the shape of network architecture, which detects salient objects from optical RSIs in a purely data-driven fashion. The proposed LV-Net consists of two key modules, i.e., a two-stream pyramid module (L-shaped module) and an encoder-decoder module with nested connections (V-shaped module). Specifically, the L-shaped module extracts a set of complementary information hierarchically by using a two-stream pyramid structure, which is beneficial to perceiving the diverse scales and local details of salient objects. The V-shaped module gradually integrates…

Tables5

Table 1. TABLE I: Input Size and Output Size (batch, width, height, channel) of Each Convolutional Unit in the Proposed LV-Net.

	Input Size	Output Size
M-CU_1	(16,64,64,3)	(16,64,64,64)
M-CU_2	(16,32,32,3)	(16,32,32,128)
M-CU_3	(16,16,16,3)	(16,16,16,256)
M-CU_4	(16,8,8,3)	(16,8,8,512)
CU_(0,0)	(16,128,128,3)	(16,128,128,64)
CU_(1,0)	(16,64,64,67)	(16,64,64,128)
CU_(2,0)	(16,32,32,131)	(16,32,32,256
CU_(3,0)	(16,16,16,259)	(16,16,16,512
CU_(4,0)	(16,8,8,515)	(16,8,8,1024
CU_(0,1)	(16,128,128,192)	(16,128,128,64)
CU_(1,1)	(16,64,64,384)	(16,64,64,128)
CU_(2,1)	(16,32,32,768)	(16,32,32,256)
CU_(3,1)	(16,16,16,1536)	(16,16,16,512)
CU_(0,2)	(16,128,128,192)	(16,128,128,64)
CU_(1,2)	(16,64,64,384)	(16,64,64,128)
CU_(2,2)	(16,32,32,768)	(16,32,32,256)
CU_(0,3)	(16,128,128,192)	(16,128,128,64)
CU_(1,3)	(16,64,64,768)	(16,64,64,128)
CU_(0,4)	(16,128,128,320)	(16,128,128,1)

Table 2. TABLE II: Quantitative Comparisons with Different Methods on the Testing Subset of ORSSD Dataset.

Method	Precision	Recall	$F_{β}$	MAE	$S_{m}$
DSR [20]	$0.6829$	$0.5972$	$0.6610$	$0.0859$	$0.7082$
RBD [18]	$0.7080$	$0.6268$	$0.6874$	$0.0626$	$0.7662$
RRWR [48]	$0.5782$	$0.6591$	$0.5950$	$0.1324$	$0.6835$
HDCT [49]	$0.6071$	$0.4969$	$0.5775$	$0.1309$	$0.6197$
DSG [50]	$0.6843$	$0.6007$	$0.6630$	$0.1041$	$0.7195$
MILPS [51]	$0.6954$	$0.6549$	$0.6856$	$0.0913$	$0.7361$
RCRR [15]	$0.5782$	$0.6552$	$0.5944$	$0.1277$	$0.6849$
SSD [29]	$0.5188$	$0.4066$	$0.4878$	$0.1126$	$0.5838$
SPS [31]	$0.4539$	$0.4154$	$0.4444$	$0.1232$	$0.5758$
ASD [33]	$0.5582$	$0.4049$	$0.5133$	$0.2119$	$0.5477$
DSS [24]	$0.8125$	$0.7014$	$0.7838$	$0.0363$	$0.8262$
RADF [25]	$0.8311$	$0.6724$	$0.7881$	$0.0382$	$0.8259$
R3Net [16]	$0.8386$	$0.6932$	$0.7998$	$0.0399$	$0.8141$
RFCN [28]	$0.8239$	$0.7376$	$0.8023$	$0.0293$	$0.8437$
LV-Net	0.8672	0.7653	0.8414	0.0207	0.8815

Table 3. TABLE III: Comparisons of the Average Running Time (seconds per image) on the Testing Subset of ORSSD Dataset and Model Size (MB).

Methods	DSR	RBD	RRWR	HDCT	DSG	MILPS	RCRR	SSD
Time	$14.22$	$0.62$	$2.91$	$7.13$	$1.57$	$26.34$	$3.14$	$-$
Model size	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
Methods	SPS	ASD	DSS	RADF	R3Net	RFCN	LV-Net
Time	$-$	$-$	$0.12$	$0.15$	$0.48$	$1.10$	$0.74$
Model size	$-$	$-$	$248$	$248$	$142$	$744$	$207$

Table 4. TABLE IV: Quantitative Evaluation of Ablation Studies on the Testing Subset of ORSSD Dataset.

	$F_{β}$	MAE	$S_{m}$
LV-Net w/o Input-Pyramid	$0.8297$	$0.0231$	$0.8713$
LV-Net w/o Feature-Pyramid	$0.8335$	$0.0227$	$0.8784$
LV-Net w/o L	$0.8248$	$0.0240$	$0.8684$
LV-Net w/o Nest	$0.7672$	$0.0280$	$0.8299$
LV-Net w/o Nest+	$0.7821$	$0.0334$	$0.8288$
V-Net	$0.8090$	$0.0297$	$0.8446$
V-Net-D	$0.8089$	$0.0302$	$0.8429$
LV-Net	0.8414	0.0207	0.8815

Table 5. TABLE V: Quantitative Evaluation of the Effects of Network Parameter Settings on the Testing Subset of ORSSD Dataset.

	$F_{β}$	MAE	$S_{m}$
LV-Net-8-16	$0.8080$	$0.0251$	$0.8532$
LV-Net-16-32	$0.8327$	$0.0213$	$0.8782$
LV-Net-3Scales	$0.7963$	$0.0348$	$0.8168$
LV-Net-4Scales	$0.8385$	$0.0249$	$0.8634$
LV-Net-S-CU	$0.8379$	$0.0212$	$0.8752$
LV-Net	0.8414	0.0207	0.8815

Equations26

I_{d s}^{k} = ma x p oo l (I_{in}),

I_{d s}^{k} = ma x p oo l (I_{in}),

F_{7 \times 7}^{k} = σ (W_{7 \times 7}^{k} * I_{d s}^{k} + b_{7 \times 7}^{k}),

F_{7 \times 7}^{k} = σ (W_{7 \times 7}^{k} * I_{d s}^{k} + b_{7 \times 7}^{k}),

F_{5 \times 5}^{k} = σ (W_{5 \times 5}^{k} * F_{7 \times 7}^{k} + b_{5 \times 5}^{k}),

F_{5 \times 5}^{k} = σ (W_{5 \times 5}^{k} * F_{7 \times 7}^{k} + b_{5 \times 5}^{k}),

F_{3 \times 3}^{k} = σ (W_{3 \times 3}^{k} * F_{5 \times 5}^{k} + b_{3 \times 3}^{k}),

F_{3 \times 3}^{k} = σ (W_{3 \times 3}^{k} * F_{5 \times 5}^{k} + b_{3 \times 3}^{k}),

F_{(0, 0)} = CU_{(0, 0)} (I_{in}),

F_{(0, 0)} = CU_{(0, 0)} (I_{in}),

F_{(p, 0)} = CU_{(p, 0)} ({F_{3 \times 3}^{p}, I_{d s}^{p}, F_{(p - 1, 0)}^{d o w n}}),

F_{(p, 0)} = CU_{(p, 0)} ({F_{3 \times 3}^{p}, I_{d s}^{p}, F_{(p - 1, 0)}^{d o w n}}),

F_{(q, 1)} = CU_{(q, 1)} ({F_{(q, 0)}, F_{(q + 1, 0)}^{u p}}),

F_{(q, 1)} = CU_{(q, 1)} ({F_{(q, 0)}, F_{(q + 1, 0)}^{u p}}),

F_{(m, 2)} = CU_{(m, 2)} ({F_{(m, 1)}, F_{(m + 1, 1)}^{u p}}),

F_{(m, 2)} = CU_{(m, 2)} ({F_{(m, 1)}, F_{(m + 1, 1)}^{u p}}),

F_{(0, 3)} = CU_{(m, 2)} ({F_{(0, 2)}, F_{(1, 2)}^{u p}}),

F_{(0, 3)} = CU_{(m, 2)} ({F_{(0, 2)}, F_{(1, 2)}^{u p}}),

F_{(1, 3)} = CU_{(m, 2)} ({F_{(1, 2)}, F_{(2, 2)}^{u p}, F_{(1, 1)}}),

F_{(1, 3)} = CU_{(m, 2)} ({F_{(1, 2)}, F_{(2, 2)}^{u p}, F_{(1, 1)}}),

F_{(0, 4)} = CU_{(0, 4)} ({F_{(0, 3)}, F_{(1, 3)}^{u p}, F_{(0, 1)}, F_{(0, 2)}}),

F_{(0, 4)} = CU_{(0, 4)} ({F_{(0, 3)}, F_{(1, 3)}^{u p}, F_{(0, 1)}, F_{(0, 2)}}),

L = - (y l o g (z) + (1 - y) l o g (1 - z)),

L = - (y l o g (z) + (1 - y) l o g (1 - z)),

L = - (y l o g (F_{c l i p} (z)) + (1 - y) l o g (1 - F_{c l i p} (z))) .

L = - (y l o g (F_{c l i p} (z)) + (1 - y) l o g (1 - F_{c l i p} (z))) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Nested Network with Two-Stream Pyramid

for Salient Object Detection in Optical

Remote Sensing Images

Chongyi Li, Runmin Cong, Junhui Hou, Sanyi Zhang, Yue Qian, and Sam Kwong Manuscript received Nov. 2018. This work was supported in part by the National Natural Science Foundation of China under Grant 61871342, in part by Hong Kong RGC General Research Funds under Grant 9042038 (CityU 11205314) and Grant 9042322 (CityU 11200116), and in part by Hong Kong RGC Early Career Schemes under Grant 9048123 (CityU 21211518). (Chongyi Li and Runmin Cong contributed equally to this work. Corresponding author: Runmin Cong)C. Li and Y. Qian are with the Department of Computer Science, City University of Hong Kong, Kowloon 999077, Hong Kong (e-mail: [email protected], [email protected]).R. Cong is the Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China, and also with the Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]).J. Hou and S. Kwong are with the Department of Computer Science, City University of Hong Kong, Kowloon 999077, Hong Kong, and also with the City University of Hong Kong Shenzhen Research Institute, Shenzhen 51800, China (e-mail: [email protected], [email protected]).S. Zhang is with the School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China (e-mail: [email protected]).

Abstract

Arising from the various object types and scales, diverse imaging orientations, and cluttered backgrounds in optical remote sensing image (RSI), it is difficult to directly extend the success of salient object detection for nature scene image to the optical RSI. In this paper, we propose an end-to-end deep network called LV-Net based on the shape of network architecture, which detects salient objects from optical RSIs in a purely data-driven fashion. The proposed LV-Net consists of two key modules, i.e., a two-stream pyramid module (L-shaped module) and an encoder-decoder module with nested connections (V-shaped module). Specifically, the L-shaped module extracts a set of complementary information hierarchically by using a two-stream pyramid structure, which is beneficial to perceiving the diverse scales and local details of salient objects. The V-shaped module gradually integrates encoder detail features with decoder semantic features through nested connections, which aims at suppressing the cluttered backgrounds and highlighting the salient objects. In addition, we construct the first publicly available optical RSI dataset for salient object detection, including 800 images with varying spatial resolutions, diverse saliency types, and pixel-wise ground truth. Experiments on this benchmark dataset demonstrate that the proposed method outperforms the state-of-the-art salient object detection methods both qualitatively and quantitatively.

Index Terms:

Salient object detection, optical remote sensing images, two-stream pyramid module, nested connections.

I Introduction

SALIENT object detection aims at locating the most attractive and visually distinctive objects or regions from an image, which has been used in image/video segmentation [1], image retargeting [2], image foreground annotation [3], thumbnail creation [4], image quality assessment [5], and video summarization [6]. The last decades have witnessed the remarkable progress of saliency detection for nature scene image, especially including the deep learning-based methods with highly competitive performance [7]. In this paper, the nature scene image (usually RGB format) refers to the one captured by hand-held cameras or cameras mounted to objects on the ground, where the objects are typically in an upright orientation. It is worth mentioning that salient object detection is different from ordinary object detection or anomaly detection. First, object detection is a generalized task that aims to detect all the objects, while salient object detection only focuses on discovering the salient objects, and anomaly detection devotes to determining the abnormal objects. Second, object detection and anomaly detection always use bounding boxes to delineate the objects, while salient object detection generates a pixel-level saliency probability map.

Similar to the related works [8, 9, 10, 11, 12], the RSIs used in this paper were collected from Google Earth with the spatial resolution ranging from 0.5m to 2m. Compared with the objects in nature scene images, the objects in RSIs (such as airports, buildings, and ships) usually have many different orientations, scales, and types since the RSIs are taken overhead [13, 14]. In addition, the optical RSI used in this paper is different from the hyperspectral image that includes more spectral bands information. The optical RSI used in this paper has the human eye friendly color presentation [8]. Facing the large-scale optical RSIs, people mainly focus on some local salient regions, e.g., man-made target and river system. Therefore, salient object detection in optical RSIs has extremely practical application values.

Optical RSI is usually photographed outdoors in a high angle shot via the satellite and aerial sensors, and thus, there may be diversely scaled objects, various scenes and object types, cluttered backgrounds, and shadow noises. Sometimes, there is even no salient region in a real outdoor scene, such as the desert, forest, and sea. Due to these unique imaging conditions and diverse scene patterns, it is difficult to achieve satisfactory performance by directly transplanting the nature scene image saliency detection methods into the optical RSI. Some visual examples of different saliency detection methods for optical RSIs are shown in Fig. 1, where RCRR [15] is an unsupervised salient object detection method for nature scene image, and R3Net [16] is a recent deep learning-based salient object detection method fine-tuned on our constructed optical RSI dataset for salient object detection. As visible, the unsupervised RCRR method thoroughly fails to detect the salient objects from the optical RSIs, and the deep learning-based R3Net method also cannot highlight the salient objects accurately and completely. In contrast, the results of the proposed LV-Net are closer to the GT. In addition, there is no publicly available optical RSI dataset with saliency annotation for performance evaluation of salient object detection methods in optical RSIs. Given all that, it is highly desirable to design a specialized method for salient object detection in optical RSIs as well as a comprehensive dataset for performance evaluation.

In this paper, we propose a novel convolutional neural networks (CNN) architecture for salient object detection in optical RSIs, named LV-Net. The main contributions are summarized as follows.

•

An end-to-end network for salient object detection in optical RSIs is proposed, including a two-stream pyramid module (L-shaped module) and an encoder-decoder module with nested connections (V-shaped module), which generalizes well to varying scenes and object patterns.

•

The L-shaped module learns a set of complementary features to address the scale variability of salient objects and capture local details, and the V-shaped module automatically determines the discriminative features to suppress cluttered backgrounds and highlight salient objects.

•

A challenging optical RSI dataset for salient object detection is constructed, including $800$ images with the corresponding pixel-wise ground truth. This dataset has been released for non-commercial use only. Moreover, the proposed method achieves the best performance against fourteen state-of-the-art salient object detection methods.

The rest of the paper is organized as follows. The related works on salient object detection are introduced in Section II. Section III presents the details of the proposed nested network with two-stream pyramid for salient object detection. In Section IV, the benchmark dataset and evaluation metrics, training strategies and implementation details, and experimental comparisons and analyses are discussed. Finally, the conclusion is drawn in Section V.

II Related Work

The past decade has witnessed significant advances and performance improvements in salient object detection. In this section, we briefly review bottom-up and top-down saliency models for nature scene image, and then discuss the salient object detection in optical RSI.

II-A Saliency Detection in Nature Scene Image

Bottom-up saliency detection model is stimulus-driven that aims at exploring low-level vision features. On one hand, some visual priors have been utilized to describe the properties of salient object based on the visual inspirations from the human visual system, such as contrast prior, background prior, and compactness prior. Cheng et al. [17] proposed a simple and efficient salient object detection method based on global contrast, in which the saliency is defined as the color contrast to all other regions in the image. Zhu et al. [18] proposed a boundary connectivity measure to evaluate the background probability of a region. Then, a principled optimization framework integrating multiple low-level cues is used to achieve saliency detection. Zhou et al. [19] combined the compactness prior with local contrast to discover salient object. Moreover, the saliency information is propagated on a graph through the diffusion framework. On the other hand, some traditional techniques have been introduced to achieve saliency detection, such as random walks, sparse representation, and matrix decomposition. Yuan et al. [15] proposed a regularized random walk ranking model, which introduces prior saliency estimation to every pixel by taking both region and pixel image features into consideration. Li et al. [20] used the reconstruction error to measure the saliency of a region, where the salient region corresponds to a larger reconstruction error. Peng et al. [21] proposed a structured matrix decomposition method guided by high-level priors with two structural regularizations to achieve saliency detection.

Top-down saliency detection model is task-driven that entails supervised learning with ground truth. Especially, deep learning has demonstrated to be powerful for salient object detection. In recent years, numerous efforts have been made to design effective network architectures for extracting useful features which can characterize the salient objects [16, 22, 23, 24, 25, 26, 27, 28]. Deng et al. [16] proposed a recurrent residual refinement network for saliency detection, where residual refinement blocks are leveraged to recurrently learn the difference between the coarse saliency map and the ground truth. Li and Yu [23] proposed a deep contrast network for saliency detection, where the multi-scale fully convolutional stream captures the visual contrast saliency, and the segment-wise spatial pooling stream simulates saliency discontinuities along object boundaries. Hou et al. [24] introduced short connections into the skip-layer structures within the Holisitcally-nested Edge Detector (HED) architecture to achieve image saliency detection, which combines the low-level and high-level features at multiple scales. Hu et al. [25] introduced the recurrently aggregated deep features (RADF) into an FCNN to achieve saliency detection by fully exploiting the complementary saliency information captured in different layers. Zhang et al. [26] utilized an encoder Fully Convolutional Network (FCN) and a corresponding decoder FCN to detect salient object, in which a Reformulated dropout (R-dropout) is introduced to construct an uncertain ensemble of internal feature units.

II-B Saliency Detection in Optical RSI

Compared to the saliency detection for nature scene image, only a small amount of work focuses on saliency detection for optical RSI. It is worth pointing out that although some so-called saliency detection methods for optical RSI have been proposed, most of them aim to realize other optical RSI processing tasks, such as Region-of-Interest (ROI) extraction and generalized object detection, by employing existing simple saliency models. Zhao et al. [29] proposed a sparsity-guided saliency detection method for optical RSIs, where the sparse representation is used to obtain global and background cues for saliency map integration. The authors collected some optical RSIs with the corresponding pixel-level saliency masks. Unfortunately, they did not make it publicly available.

Dong et al. [9] employed the visual saliency detection to locate the ROIs and homogeneous backgrounds in the prescreening state, which significantly reduces the false alarms. Then, a rotation-invariant descriptor was proposed for ship detection in optical RSIs. Li et al. [30] calculated the saliency in order to assist building extraction by combining the region contrast, boundary connectivity, and background constraints. In [31], the texture saliency and color saliency were integrated into the pixel-level saliency map to extract the ROI. Li et al. [32] proposed a hierarchical ROI detection method for optical RSIs, where the multilevel color histogram contrast is used to compute the saliency and obtain the preliminary regions. In [33], a two-way saliency model that combines vision-oriented saliency and knowledge-oriented saliency was proposed to estimate the airport position.

The unique imaging conditions and diverse scene patterns pose new challenges for salient object detection in optical RSIs. Thus, it is difficult to achieve satisfactory performance by directly using the existing nature scene image saliency detection methods. Moreover, due to the lack of sufficient training data and elaborate network architecture to handle the multiple salient objects with diverse scales and complicated backgrounds, the superiority of deep learning has not been demonstrated in the salient object detection of optical RSIs. To address the above-mentioned problems, we propose a deep learning-based salient object detection method specially designed for optical RSIs, and construct the first publicly available optical RSI dataset for salient object detection.

III Proposed Method

III-A Framework

In Fig. 2, we present the LV-Net network architecture including a two-stream pyramid module (L-shaped module) and an encoder-decoder module with nested connections (V-shaped module), which takes an optical RSI as input and outputs its saliency map.

To address the different scales of salient objects in optical RSIs, an L-shaped module is designed in the LV-Net. First, we progressively down-sample the input optical RSI for input pyramid generation. Then, we extract the multi-scale feature representations of each down-sampled input through a multi-scale convolution unit, and finally form a multi-scale feature pyramid. The input pyramid preserves original detail features of input images, and the feature pyramid provides abstract semantic features. Both the detail and semantic features are significant for the task of salient object detection. Therefore, we concatenate multi-resolution input versions and multi-scale features at different levels to form the two-stream pyramid and obtain complementary features.

The L-shaped module can extract detail features and semantic features, but it is still not enough to accurately and completely detect the salient objects in optical RSIs. Thus, the complementary features hierarchically extracted by the two-stream pyramid structure are passed to an encoder-decoder module, which gradually integrates encoder detail features and decoder semantic features with nested connections. At the end, the salient regions of an input optical RSI are predicted by the integrated features in a deeply supervised manner. From the view of feature type, by combining the features from L-shaped and V-shaped modules, the final features are relatively more comprehensive than the features only from the L-shaped module or the features only from the V-shaped module. From the view of network optimization, the purpose of networks is to learn the discriminative feature representations to assist saliency detection and boost the final saliency performance. In Fig. 3, we provide the features visualization of the proposed network. As visible, the features progressively become discriminative (close to the final saliency map) which can effectively distinguish the foreground and background, such as the features in CU(0,3) and CU(1,3). In addition, one can find that the detail features (e.g., edges and textures) in the encoder path become more and more abstract with the down-sampling, while the cluttered and noisy backgrounds gradually vanish with the nested connections and up-sampling in the decoder path. Thus, from different views, the features extracted by the combination of the L-shaped and V-shaped modules are comprehensive and discriminative. Next, we will illustrate the advantages and implementation details of these two modules.

III-B Two-Stream Pyramid Module

As mentioned earlier, the type and scale of the objects in the optical RSI are variable and diverse, including some small scaled airplanes or large bodies of water. To deal with the scale variability of image patterns, we design an input pyramid structure and pass scaled versions through our network.

We employ the $2\times$ max pooling layer to progressively down-sample input image and generate the input pyramid:

[TABLE]

where $k\in\{1,2,3,4\}$ indexes the down-sampling layer along the input image, $I_{ds}^{k}$ is the $k^{th}$ 2 $\times$ downsampling result of the input $I_{in}$ , and $maxpool$ represents the max pooling operation.

Then, we extract multi-scale features from the scaled versions by the M-CU to form the multi-scale feature pyramid. The number of output feature maps of each convolutional layer in the M-CU is denoted as $32\times 2^{k}$ . Here, $k$ indexes the down-sampling layer along the input image. The M-CU operation can be expressed as:

[TABLE]

where $F_{7\times 7}^{k}$ is a set of features extracted from $I_{ds}^{k}$ , $F_{5\times 5}^{k}$ is a set of features extracted from $F_{7\times 7}^{k}$ , $F_{3\times 3}^{k}$ is a set of features extracted from $F_{5\times 5}^{k}$ , $*$ represents a convolutional operation, $\mathbf{W}$ and $\mathbf{b}$ stand for the weight and bias, and $\sigma$ is the element-wise rectified linear unit (ReLU) activation function [34].

Finally, we concatenate multi-resolution input versions and multi-scale features at different levels to form the two-stream pyramid. The main reasons for using the two-stream pyramid lie in two aspects: (1) The input pyramid includes several scaled input images, which preserves original detail features of input images, but lacks semantic information. In contrast, the feature pyramid extracts multi-level features by consecutive convolution operations, which provides abstract semantic features, but lacks detail information. However, both detail features and semantic features are significant for the task of salient object detection. Therefore, we concatenate multi-resolution input versions and multi-scale features at different levels to form the two-stream pyramid to obtain complementary feature representations. Here, from the view of feature content, the concrete detail features and abstract semantic features are complementary. In Fig. 4, we present a visual example to explain the complementary behavior of the features extracted by the two-stream pyramid. For the LV-Net without the feature pyramid, the final saliency map may be incomplete since the semantic features extracted by the feature pyramid are removed. For the LV-Net without the input pyramid, the final saliency map may lose some details, such as the holes in the saliency map. In contrast, when the input pyramid and the feature pyramid are combined, the final saliency map preserves complete structures and clear details. (2) The two-stream pyramid can attain multiple level receptive fields by the consecutive convolution operations and the preservation of original detail features. Thus, we did not use the time-consuming dense connection pattern like DenseNet [35] to stack the features from all layers. We will demonstrate the effects of two-stream pyramid in Section IV.

III-C Encoder-Decoder Module with Nested Connections

Essentially, our architecture is a deeply supervised encoder-decoder network, where the encoder and decoder pathways are connected through a series of nested connections. The basic idea behind the use of nested connections is that the nested connections would automatically select more discriminative saliency features by the supervised learning, so that it could facilitate the fusion of encoder-decoder features and remit the interferences of cluttered and noisy backgrounds. In an encoder-decoder system, the model that only uses the high-level decoder features is unable to capture the details of the salient objects, while the model with only the low-level encoder features fails to distinguish the salient objects from the complicated backgrounds. To accurately capture the salient objects with exact boundaries, some encoder-decoder network architectures (e.g., U-Net [36]) usually concatenate encoder detail features and decoder semantical features through the brute-force skip connections, whose effectiveness has been proven in the saliency detection for simple nature scene image and medical image segmentation. Unfortunately, we found that the brute-force skip connections can degrade the quality of saliency prediction because the cluttered and noisy encoder features can also be passed through the prediction layer, especially for optical RSIs with complicated backgrounds. The ‘bad’ features seriously affect the accuracy of the saliency prediction. Therefore, we use the nested connections to gradually filter out the ‘bad’ distractive features and make salient objects stand out by task-driven learning. We will conduct an ablation study to verify our findings in Section IV.

Specifically, the operations in the encoder-decoder module with nested connections can be expressed as follows.

[TABLE]

where $F_{(0,0)}$ is a set of features extracted by convolution unit CU*(0,0), $F_{(p,0)}$ is a set of features extracted by CU(p,0), $p\in\{1,2,3,4\}$ , $F_{(p-1,0)}$ is a set of features extracted by CU(p-1,0)*, $F_{(p-1,0)}^{down}$ is the $2\times$ down-sampling result of $F_{(p-1,0)}$ , $F_{3\times 3}^{p}$ is a set of outputs of $p^{th}$ M-CU, $I_{ds}^{p}$ is the $p^{th}$ $2\times$ down-sampling result of the input optical RSI $I_{in}$ , { $F_{3\times 3}^{p},I_{ds}^{p},F_{(p-1,0)}$ } represents the concatenation of $F_{3\times 3}^{p}$ , $I_{ds}^{p}$ and $F_{(p-1,0)}^{down}$ . Note that, $\{\cdot\}$ indicates the feature concatenation operation (not the element-wise addition).

[TABLE]

where $F_{(q,1)}$ is a set of features extracted by the convolution unit CU*(q,1), and $q\in\{0,1,2,3\}$ . $F_{(m,2)}$ is a set of features extracted by the convolution unit CU(m,2)*, $m\in\{0,1,2\}$ . $F_{(q+1,0)}^{up}$ is the up-sampling result of $F_{(q+1,0)}$ , $F_{(m+1,1)}^{up}$ is the up-sampling result of $F_{(m+1,1)}$ .

[TABLE]

where $F_{(0,3)}$ and $F_{(1,3)}$ are the features extracted by the convolution units CU*(0,3)* and CU*(1,3)*, respectively.

[TABLE]

where $F_{(0,4)}$ is the output of the CU*(0,4), and also is the final saliency map. The number of output feature maps of each CU in the encoder-decoder module is denoted as $64\times 2^{i}$ , except for the CU(0,4)* (i.e., the output number is 1), where $i$ indexes the down-sampling layer along the encoder pathway.

III-D Loss Function

Following [24, 37], we also minimize the sigmoid cross-entropy loss to learn the saliency prediction mapping:

[TABLE]

where [math] and $1$ represent the non-salient and salient region labels, $y$ and $z$ are the true label and the score of the predicted result. However, we found that this loss function does not always work ( $L\to\infty$ ) when the predicted score $z$ is [math] or $1$ . It is possible for optical RSI when there is no salient object. Thus, we rewrite the sigmoid cross-entropy loss as:

[TABLE]

where $F_{clip}$ is a function that returns a tensor of the same type and shape as input with its values clipped to $\rho$ and $\mu$ . Specifically, any values less than $\rho$ are set to $\rho$ , while any values greater than $\mu$ are set to $\mu$ . Here, we set $\rho$ and $\mu$ to $1e^{-15}$ and $1-1e^{-15}$ , respectively.

IV Experiments

In this section, we first introduce the constructed benchmark dataset for salient object detection in optical RSI, and then illustrate the evaluation metrics, training strategies, and implementation details. Then, we compare the proposed LV-Net with state-of-the-art methods to demonstrate its advantage. At last, some ablation studies are discussed to verify the role of each component and analyze the effects of parameter settings.

IV-A Benchmark Dataset

To the best of our knowledge, there is no publicly available optical RSI dataset for salient object detection. To fill the gap, we collected $800$ optical RSIs to construct a dataset for salient object detection, named ORSSD dataset, and the manually pixel-wise annotation for each image is provided. Most of the original optical RSIs are collected from the Google Earth, and several RSIs are from the existing RSI databases, such as NWPU VHR-10 dataset [38], AID dataset [39], LEVIR dataset [40], and NWPU-RESISC45 dataset[41]. For ground truth annotation, we firstly selected 5 people with relevant background to mark the salient objects independently. Then, only the objects marked four times would be selected as the ground truth of salient objects. The ORSSD dataset is very challenging, because a) the spatial resolution is diverse, such as $1264\times 987$ , $800\times 600$ , and $256\times 256$ , b) the background is cluttered and complicated, including some shadows, trees, and buildings, c) the type of salient objects is various, including airplane, ship, car, river, pond, bridge, stadium, beach, etc., and d) the number and size of salient objects are variable, even in some scenes there are no salient object, such as the desert and thick forest. Some of the sample images in the constructed ORSSD dataset are shown in Fig. 5. The ORSSD dataset can be used for deep network training and the performance evaluation of salient object detection in optical RSIs, which is available from our project https://li-chongyi.github.io/proj_optical_saliency.html.

IV-B Evaluation Metrics

To quantitatively evaluate the performance of different methods, the Precision-Recall (PR) curve, F-measure, MAE score, and S-measure are employed.

Using a series of fixed integers from [math] to $255$ , a saliency map can be thresholded as some binary saliency masks. Then, comparing the binary mask with the ground truth, the precision and recall scores are obtained. The PR curve is drawn under different combination of precision and recall scores, where the vertical axis denotes the precision score, and the horizontal axis corresponds to the recall score. The closer the PR curve is to the coordinates $(1,1)$ , the better the performance achieves.

As a comprehensive measurement, F-measure is defined as the weighted harmonic mean of precision and recall [42], i.e., $F_{\beta}=\frac{(1+\beta^{2})Precision\times Recall}{\beta^{2}\times Precision+Recall}$ , where $\beta^{2}$ is set to $0.3$ for emphasising the precision as suggested in [43, 42]. The larger $F_{\beta}$ value indicates the better comprehensive performance.

MAE score [44] calculates the difference between the continuous saliency map $S$ and ground truth $G$ , i.e., $MAE=\frac{1}{w\times h}\sum_{i=1}^{w}\sum_{j=1}^{h}|S(i,j)-G(i,j)|$ , where $w$ and $h$ represent the width and height of the image, respectively. The smaller the MAE score is, the more similar to the ground truth, and the better performance achieves.

S-measure [45] evaluates the structural similarity between the saliency map and ground truth, i.e., $S_{m}=\alpha\times S_{o}+(1-\alpha)\times S_{r}$ , where $\alpha$ is set to $0.5$ for assigning equal contribution to both region $S_{r}$ and object $S_{o}$ similarity as suggested in [45]. The larger $S_{m}$ value demonstrates better performance in terms of the structural similarity.

IV-C Training Strategies and Implementation Details

Network Training. We randomly selected $600$ images from ORSSD dataset for training and the rest $200$ images as the testing dataset. We augmented data with flipping and rotation, and then obtained seven additional augmented versions of the original training data. In the training phase, the samples were resized to size $128\times 128$ due to our limited memory. At last, the augmented training data provided $4800$ pairs of images.

Implementation Details. We implemented the proposed LV-Net with TensorFlow on a PC with an Intel(R) i $7$ $6700$ CPU, $32$ GB RAM, and an NVIDIA GeForce GTX $1080$ Ti GPU. During training, a batch-mode learning method with a batch size of $16$ was applied. The filter weights of each layer were initialized by Xavier policy [46], and the bias was initialized as constant. We used ADAM [47] for network optimization and fixed the learning rate to $1e^{-4}$ in the entire training procedure. To illustrate the implementation details, we report the input size and output size for each convolutional unit in the procedure of network training in Table I.

IV-D Comparison with State-of-the-art Methods

We compare the proposed method with fourteen state-of-the-art salient object detection methods on the testing subset of ORSSD dataset, including seven unsupervised methods (DSR [20], RBD [18], RRWR [48], HDCT [49], DSG [50], MILPS [51], and RCRR [15]), four deep learning-based methods (DSS [24], RADF [25], R3Net [16], and RFCN [28]), and three saliency detection methods in optical RSIs (SSD [29], SPS [31], and ASD [33]). All the results are generated by the source codes or provided by the authors. For a fair comparison, we retrained the compared deep learning-based saliency detection methods using the same training data with the proposed LV-Net and the default parameter settings in the corresponding models. We tuned to generate best results of the compared methods. The visual comparisons are shown in Fig. 6, and the quantitative evaluations are reported in Fig. 7 and Table II.

In Fig. 6, six optical RSIs with different salient objects (e.g., ships, cars, airplanes, playground, and island) are illustrated. From it, we can see that the unsupervised methods (i.e., DSR, DSG, and RCRR) cannot highlight the salient objects effectively and completely. For example, in the first scene, some backgrounds (e.g., tree lawn) are wrongly detected as the salient regions, and the four cars could not be totally detected by these unsupervised methods. For the salient objects with small sizes (e.g., the airplanes in the second image and fourth image, and the ships in the fifth image), the unsupervised methods cannot handle this challenging situation effectively. Especially, the RCRR method [15] is completely powerless to locate these salient objects in such a challenging scene. In terms of the existing salient object detection methods for optical RSI (e.g., SSD [29], SPS [31], and ASD [33]), they fail to locate the salient objects in the challenging optical RSIs because they only use or modify the hand-crafted features designed for nature scene images in a coarse way and ignore the unique characteristics of optical RSIs. By contrast, the deep learning-based methods achieve obvious superiority in performance. For example, in the first image, the complicated backgrounds are effectively suppressed by the deep learning-based methods, and most of the salient objects can be clearly detected by the DSS method [24]. However, there are several drawbacks in these methods, including missing detection (e.g., the cars in the first image using the RADF method [25]), wrong detection (e.g., the ships in the fifth image using DSS method [24] and R3Net method [16], and incomplete detection (e.g., the playground in the third image using the R3Net method [16], and the island in the last image using the DSS [24]). Benefiting from both the two-stream pyramid and nested connections, more comprehensive and discriminative representations can be learned to filter out the interferences in the complicated backgrounds, refine the details of the salient objects, and improve the identification accuracy of salient objects. As visible, the proposed LV-Net method highlights the salient object more accurate and complete. Moreover, the backgrounds by our results are suppressed effectively.

The PR curves are shown in Fig. 7. The PR curve describes the different combination of precision and recall scores, and the closer the PR curve is to the coordinates $(1,1)$ , the better performance achieves. Compared with other methods, the proposed LV-Net algorithm achieves a higher recall score while achieving a higher precision score, and thus, its PR curve is much higher than other methods with a large margin. In other words, the proposed LV-Net achieves a win-win situation between precision and recall, which demonstrates the effectiveness of the proposed algorithm. Table II reports the quantitative measures of different methods, including the Precision score, Recall score, F-measure, MAE score, and S-measure. The best result for each evaluation is in bold. The same conclusion can be drawn from Table II, that is, the performance of the proposed LV-Net is significantly superior to the state-of-the-art methods in terms of Precision score, Recall score, F-measure, MAE score, and S-measure. Compared with the second best method, the performance improvement of the proposed LV-Net is still obvious, i.e., the percentage gain reaches $3.4\%$ in terms of the Precision score, $3.8\%$ in terms of the Recall score, $4.9\%$ in terms of the F-measure, $29.4\%$ in terms of the MAE score, and $6.7\%$ in terms of the S-measure. It is worth mentioning that our method achieves significant performance improvements when compared with the existing saliency detection methods in optical RSIs. For example, compared with the ASD method [33], the percentage gain reaches $63.9\%$ in terms of F-measure, $90.2\%$ in terms of MAE score, and $61.0\%$ in terms of S-measure.

In Table III, we report the average running time (seconds per image) of different methods on the testing subset of ORSSD dataset and the model size (MB) of deep learning-based methods. The model size is a commonly used metric which indicates the complexity of deep learning-based methods. Since the results of SSD, SPS, and ASD methods are directly provided by authors, we are unable to calculate the running time. As shown, most of deep learning based methods are faster than the traditional methods during the testing phase. As shown in the second row of Table III, the proposed method ranks the second smallest in terms of the model size. All these visual examples and quantitative measures demonstrate the effectiveness and efficiency of the proposed LV-Net.

IV-E Module Analysis

To demonstrate the improvements obtained by each component in our LV-Net, we conduct the following ablation studies111The architectures of the compared networks are provided in the supplementary material.:

•

LV-Net w/o Input-Pyramid: LV-Net without the input pyramid

•

LV-Net w/o Feature-Pyramid: LV-Net without the feature pyramid

•

LV-Net w/o L: LV-Net without the L-shaped module

•

LV-Net w/o Nest: LV-Net without the nested connections

•

LV-Net w/o Nest+: LV-Net without the nested connections, but with the skip connections at different levels from encoder pathway to decoder pathway

•

V-Net: LV-Net without the L-shaped module and the nested connections, but with the skip connections at different levels from encoder pathway to decoder pathway

•

V-Net-D: V-Net with the double number of feature maps

For a fair comparison, we use the same network parameters as the aforementioned settings, except for the V-Net-D. We double the number of feature maps for the V-Net to demonstrate that the good performance of the proposed LV-Net is not on account of the larger hyper-parameters. The quantitative results on the testing subset are reported in Table IV.

In Table IV, the proposed LV-Net achieves the superior performance to other variant networks in terms of all the evaluation metrics, which indicates the advantage of our network architecture. The separate performance of the LV-Net w/o Input-Pyramid and LV-Net w/o Feature-Pyramid is worse than the proposed LV-Net, indicating that the complementary combination of detail features and semantic features is effective and boosts the performance of our network. Moreover, the input pyramid stream can bring more performance gains than the feature pyramid stream because it preserves the original input information, which is significant for the densely connected network structure. Additionally, the performance of LV-Net w/o Input-Pyramid and LV-Net w/o Feature-Pyramid is slightly better than the LV-Net w/o L, which indicates both input pyramid and feature pyramid have positive effects on the LV-Net. The LV-Net w/o Nest and LV-Net w/o Nest+ rank the last two, which demonstrates the importance of the nest connections. Besides, the comparisons between the LV-Net w/o Nest and the LV-Net w/o Nest+ also indicate that the brute-force skip connections do not introduce obvious performance improvement in the task of salient object detection in optical RSIs. In fact, the final performance is directly related to the learned features. Compared with the LV-Net w/o L and the LV-Net w/o Nest, the LV-Net achieves more competitive performance, indirectly demonstrating that more comprehensive and discriminative feature representations can be learned by the LV-Net than any one of them. The V-Net achieves performance comparable to the V-Net-D, which indicates that enlarging the number of hyper-parameters does not obtain better performance.

In addition, some visual comparison results are shown in Fig. 8. Observing Fig. 8(a), we can see that the detected salient objects by the proposed LV-Net are more clear and sharper than the LV-Net w/o L variation, such as the boundary and edge. In Fig. 8(b), the salient objects by the LV-Net w/o Nest variation have no clear and complete boundary due to the lack of the low-level detail features. In contrast, the result of LV-Net w/o Nest+ obtains more complete boundary; however, it introduces the background noise through the brute-force skip connections. As can be seen from Fig. 8(c), the backbone of the V-Net fails to filter out the backgrounds effectively when compared with the proposed LV-Net, which further indicates that the importance of the L-shaped module and the nested connections used in the proposed LV-Net.

In summary, the ablation studies demonstrate that a) the competitive performance of the proposed LV-Net benefits from the introduction of both the two-stream pyramid and the nested connections; b) the backbone of the V-Net is not suitable for salient object detection in optical RSIs with complicated backgrounds because of the use of brute-force skip connections; and c) reasonable network design can achieve better performance than only enlarging the hyper-parameters.

IV-F Parameter Analysis

We conduct experiments to analyze the effects of network parameter settings, including the numbers of filters and scales, and the kernel size. The quantitative comparison results on the testing subset are reported in Table V.

First, based on the default settings of the number of output features (i.e., $32\times 2^{k}$ for each convolutional layer in the M-CU, and $64\times 2^{i}$ for each convolutional layer in the encoder-decoder module), we reduce the number of output features to ( $16\times 2^{k}$ , $32\times 2^{i}$ ) and ( $8\times 2^{k}$ , $16\times 2^{i}$ ) for comparisons and denote them as LV-Net-16-32 and LV-Net-8-16, respectively. Here, $k$ and $i$ index the down-sampling layer along the input image and encoder pathway, respectively. Second, we analyze how the number of scales affects the performance. To be specific, we remove the branch of the smallest scale (i.e., reducing $5$ scales to $4$ scales). Then, we adjust the output of the CU*(0,3)* as the final saliency map and denote it as LV-Net-4Scales. Similarly, we reduce $5$ scales of the proposed LV-Net to $3$ scales (i.e., the output of CU*(0,2)* as the final saliency map) and denote it as LV-Net-3Scales. Third, we change the multi-scale convolution operations in the M-CU to single-scale convolution operation (i.e., $3\times 3$ , $3\times 3$ , and $3\times 3$ ) and denote it as LV-Net-S-CU.

As reported in Table V, as the decrease of the number of output features, the values of the F-measure and S-measure decrease (the values of the MAE increase). Such results comply with the common conclusion that more features can improve the performance of deep networks to some extent. In addition, we can see that as the number of scales decreases, the performance of the network decreases. The reason is that more scales will contain more useful information, which is beneficial to the learning of more comprehensive feature representations for saliency detection. It also indicates that the multi-scale input and more nested connections can obtain gains. From the results of LV-Net-S-CU, we can see that the multi-scale convolutional layers in the M-CU are helpful for the improvement of the proposed LV-Net.

V Conclusion

In this paper, we proposed the LV-Net for salient object detection in optical RSIs. Benefiting from both the two-stream pyramid module and the nested connections, the proposed LV-Net can accurately locate the salient objects with diverse scales and effectively suppress the cluttered backgrounds. Moreover, we constructed an optical RSI dataset for salient object detection with pixel-wise annotation. Experiments demonstrate the proposed method significantly outperforms the state-of-the-art methods both qualitatively and quantitatively. The module analysis and parameter discussion demonstrate the effectiveness of each designed component and the parameter settings in the proposed LV-Net.

To further improve the edge sharpness and spatial consistency of the salient regions, we will add a context module at the end of the proposed LV-Net to capture more contextual information. We will also extend the ORSSD dataset and periodically update the results for noticeable salient object detection methods on it. Moreover, some extension works of saliency detection, such as co-saliency detection [52, 53, 54, 55], will be further extended to optical RSIs in the future.

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware video object segmentation,” IEEE Trans. Patt. Anal. Mach. Intell. , vol. 40, no. 1, pp. 20–33, Jan. 2018.
2[2] Y. Fang, Z. Chen, W. Lin, and C.-W. Lin, “Saliency detection in the compressed domain for adaptive image retargeting,” IEEE Trans. Image Process. , vol. 21, no. 9, pp. 3888–3901, Sep. 2012.
3[3] X. Cao, C. Zhang, H. Fu, X. Guo, and Q. Tian, “Saliency-aware nonparametric foreground annotation based on weakly labeled data,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 27, no. 6, pp. 1253–1265, Jun. 2016.
4[4] W. Wang, J. Shen, Y. Yu, and K.-L. Ma, “Stereoscopic thumbnail creation via efficient stereo saliency detection,” IEEE Trans. Vis. Comput. Graph , vol. 23, no. 8, pp. 2014–2027, Aug. 2017.
5[5] K. Gu, S. Wang, H. Yang, W. Lin, G. Zhai, X. Yang, and W. Zhang, “Saliency-guided quality assessment of screen content images,” IEEE Trans. Multimedia , vol. 18, no. 6, pp. 1098–1110, Jun. 2016.
6[6] H. Jacob, F. Padua, A. Lacerda, and A. Pereira, “Video summarization approach based on the emulation of bottom-up mechanisms of visual attention,” J. Intell. Information Syst. , vol. 49, no. 2, pp. 193–211, Feb. 2017.
7[7] R. Cong, J. Lei, H. Fu, M.-M. Cheng, W. Lin, and Q. Huang, “Review of visual saliency detectioin with comprehensive information,” IEEE Trans. Circuits Syst. Video Technol , vol. PP, no. 99, pp. 1–19, 2018.
8[8] H. Lin, Z. Shi, and Z. Zou, “Fully convolutional network with task partitioning for inshore ship detection in optical remote sensing images,” IEEE Geosci. Remote Sens. Lett. , vol. 14, no. 10, pp. 1665–1669, Oct. 2017.