S3: A Spectral-Spatial Structure Loss for Pan-Sharpening Networks
Jae-Seok Choi, Yongwoo Kim, Munchurl Kim

TL;DR
This paper introduces the S3 loss, a novel spectral-spatial structure loss function that improves the quality of pan-sharpened satellite images by reducing artifacts caused by misalignments.
Contribution
The paper proposes the S3 loss function, enhancing CNN-based pan-sharpening by effectively handling misalignments and improving visual quality of the output images.
Findings
Significant reduction of artifacts in pan-sharpened images
Improved visual quality across various CNN architectures
Effective handling of sensor misalignments
Abstract
Recently, many deep-learning-based pan-sharpening methods have been proposed for generating high-quality pan-sharpened (PS) satellite images. These methods focused on various types of convolutional neural network (CNN) structures, which were trained by simply minimizing a spectral loss between network outputs and the corresponding high-resolution multi-spectral (MS) target images. However, due to different sensor characteristics and acquisition times, high-resolution panchromatic (PAN) and low-resolution MS image pairs tend to have large pixel misalignments, especially for moving objects in the images. Conventional CNNs trained with only the spectral loss with these satellite image datasets often produce PS images of low visual quality including double-edge artifacts along strong edges and ghosting artifacts on moving objects. In this letter, we propose a novel loss function, called a…
| Method / Metric | Avg. ERGAS1 | Avg. SCC1 | Avg. SCC0 | Avg. n-ERGAS1 |
| Bicubic | 0.68180.0401 | 0.85770.0089 | 0.48260.0126 | 0.68180.0401 |
| Provided PS | 3.68450.2036 | 0.96470.0014 | 0.96790.0012 | 3.35060.1849 |
| PanNet [16] | 0.43600.0254 | 0.84680.0094 | 0.73670.0092 | 0.43600.0254 |
| PanNet-S3 (Ours) | 3.06470.1830 | 0.95300.0015 | 0.95680.0012 | 2.64650.1533 |
| BDPN [20] | 1.36950.0779 | 0.88360.0069 | 0.90510.0039 | 1.36910.0779 |
| BDPN-S3 (Ours) | 3.33800.2043 | 0.95630.0015 | 0.95800.0012 | 2.82210.1681 |
| DSen2 [19] | 0.42780.0258 | 0.85080.0088 | 0.64850.0142 | 0.42780.0258 |
| DSen2-S3 (Ours) | 3.18000.1929 | 0.95360.0015 | 0.95390.0013 | 2.69420.1575 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
S3: A Spectral-Spatial Structure Loss for Pan-Sharpening Networks
Jae-Seok Choi, Yongwoo Kim, and Munchurl Kim This paper was submitted for review on Apr. 9, 2019 (Corresponding author: Munchurl Kim). This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (No. 2017R1A2A2A05001476). We thank Korea Aerospace Research Institute for providing the KOMPSAT-3A satellite dataset for our experiments.J.-S. Choi and M. Kim are with the School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea (e-mail: [email protected]; [email protected]). Y. Kim is with the Artificial Intelligence Research Division, Korea Aerospace Research Institute, Daejeon 34133, South Korea (e-mail: [email protected]).
Abstract
Recently, many deep-learning-based pan-sharpening methods have been proposed for generating high-quality pan-sharpened (PS) satellite images. These methods focused on various types of convolutional neural network (CNN) structures, which were trained by simply minimizing a spectral loss between network outputs and the corresponding high-resolution multi-spectral (MS) target images. However, due to different sensor characteristics and acquisition times, high-resolution panchromatic (PAN) and low-resolution MS image pairs tend to have large pixel misalignments, especially for moving objects in the images. Conventional CNNs trained with only the spectral loss with these satellite image datasets often produce PS images of low visual quality including double-edge artifacts along strong edges and ghosting artifacts on moving objects. In this letter, we propose a novel loss function, called a spectral-spatial structure (S3) loss, based on the correlation maps between MS targets and PAN inputs. Our proposed S3 loss can be very effectively utilized for pan-sharpening with various types of CNN structures, resulting in significant visual improvements on PS images with suppressed artifacts.
Index Terms:
Convolutional neural network (CNN), deep learning, pan sharpening, pan colorization, satellite imagery, spectral spatial structure, super resolution (SR).
I Introduction
Due to their sensor resolution constraints and bandwidth limitation, satellites often acquire multi-resolution multi-spectral images of the same target areas. In general, satellite images include pairs of a low-resolution (LR) multi-spectral image (MS) of longer ground sample distance (GSD), and a high-resolution (HR) panchromatic (PAN) image of shorter GSD. By extracting high-quality spatial structures from a PAN image and multi-spectral information from an MS image, one can generate a pan-sharpened (PS) image which has the same GSD as that of the PAN image but with the spectral information of the MS image. This is known as pan-sharpening or pan-colorization.
I-A Related Works
Traditional pan-sharpening methods [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] include component substitution [1, 2, 3, 4, 5], multiresolution analysis [6, 7] and machine-learning [8, 9, 10]. Comparisons for component substitution and multiresolution analysis based approaches were presented thoroughly in [11]. Component substitution based methods often incorporated the Brovey transform (BT) [1], the intensity-hue-saturation [2], principal component analysis (PCA) [3], or matting models [4] for pan-sharpening. In multiresolution analysis based methods, the spatial structures of PAN images are decomposed using wavelet [6] or undecimated wavelet [7] decomposition techniques, and are fused with up-sampled MS images to produce PS images. These methods have relatively low computation complexity but tend to produce PS images with mismatched spectral information. Machine-learning based methods [8, 9, 10] learn pan-sharpening models by optimizing a loss function of inputs and targets with some regularization terms.
With the advent of deep-learning, recent pan-sharpening methods [15, 16, 17, 18, 19, 20] started to incorporate various types of convolutional neural network (CNN) structures and are showing a large margin of quality improvements over traditional pan-sharpening methods. Most of these CNN-based pan-sharpening methods utilized network structures that were proven to be effective in classification [21, 22] and super-resolution (SR) [23, 24, 25] tasks. As the goal for pan-sharpening is to increase the resolution of MS inputs, many conventional CNN-based pan-sharpening methods employed network structures from the previous CNN-based SR methods [23, 24, 25]. Pan-sharpening CNN (PNN) [15] is known as the first method to employ CNN into pan-sharpening. The PNN used a shallow 3-layered network adopted from SRCNN [23] which is the first CNN-based SR method. The PNN was trained and tested on the Ikonos, GeoEye-1 and WorldView-2 satellite image datasets. Inspired by the success of ResNet [21] in classification, PanNet [16] incorporated the ResNet architecture with a smaller number of filter parameters to perform pan-sharpening. Lanaras et al. [19] employed the state-of-the-art SR network, EDSR [25], and proposed a moderately deep network version (DSen2) and a very deep network version (VDSen2) for pan-sharpening. Recently, a bidirectional pyramid network (BDPN) [20] has been proposed, using deep and shallow networks for PAN and MS inputs separately.
I-B Our Contributions
Since the state-of-the-art CNN-based pan-sharpening methods, PanNet [16], DSen2 [19] and BDPN [20] were trained using a simple spectral loss function for minimizing reconstruction error between generated images and MS target images, their PS result images often suffer from visually unpleasant artifacts along building edges and on moving cars in their resulting PS images of shorter GSD such as the WorldView-3 dataset. This is because, as GSD becomes smaller, pixel misalignments between PAN and MS inputs tend to get larger due to inevitable acquisition time difference and mosaicked sensor arrays. In such scenarios, the spectral loss between network outputs and MS target images are insufficient for training, thus resulting in the PS images of low visual quality.
In this letter, we propose a novel loss term, called a spectral-spatial structure (S3) loss, which can be effectively utilized for training of pan-sharpening CNNs to learn spectral information of MS targets while preserving the spatial structure of PAN inputs. Our S3 loss consists of two loss functions: a spectral loss between network outputs and MS targets, and a spatial loss between network outputs and PAN inputs. Here, both spectral and spatial losses are computed based on the correlation maps between MS targets and PAN inputs. The spectral loss is selectively applied for the areas where averaged MS targets and PAN inputs are highly correlated. The spatial loss only considers gradient maps of generated images (network output) and PAN inputs. In doing so, our network using the S3 loss can generate PS images where double-edge artifacts and ghosting artifacts on moving cars are significantly reduced. Finally, we show that our S3 loss can effectively work with various pan-sharpening CNNs. Fig. 1 shows a CNN-based pan-sharpening architecture with our proposed S3 loss.
II Proposed Method
II-A Formulations
Most of satellite imagery datasets include PAN images of higher resolution (smaller GSD) , and the corresponding MS images of lower resolution (larger GSD) . Here, the subscripts of and denote a level of resolution where a smaller number is for a higher resolution. We have two scenarios in terms of scales (resolutions): (i) Our final goal in pan-sharpening is to utilize both and inputs to generate a high-quality PS image , which has the same resolution as , while preserving spectral information of . This case corresponds to the original scale scenario in [16, 19]; (ii) Now we consider a pan-sharpening model that requires training using input and target pairs. For target images, we use . For input images, we use and , which are down-scaled versions of and respectively, using a degradation model [19]. The pan-sharpening CNN takes and as inputs, and generates . This case corresponds to the lower scale scenario in [16, 19]. In conclusion, training and testing the pan-sharpening networks are performed under the lower and original scale scenarios, respectively. In this regard, the conventional pan-sharpening networks were trained by simply minimizing a spectral loss between network outputs and MS targets under the lower scale scenario.
II-B Proposed S3 Loss
We now define our spectral-spatial structure (S3) loss, which can be used for training any pan-sharpening CNN to yield high-quality PS images , and ultimately . First, we define our feedforward pan-sharpening operation as
[TABLE]
where is a pan-sharpening CNN with filter parameters . The conventional methods [16, 19] use the L2 loss as
[TABLE]
However, solely using this loss function for training often leads to artifacts in resultant images , due to inherent misalignments between and . To overcome this limitation, we propose S3 loss consisting of two loss functions: a spectral loss between and ; and a spatial loss between and . First, the spectral loss more penalizes the spectral distortion on the areas where grayed (denoted as ) and are highly correlated. The correlation map can be formulated as
[TABLE]
where is a mean filter, denotes an element-wise multiplication, is a control parameter, and is a very small value, i.e. . We use a 3131 box filter for . We empirically set to 4. Using , our spectral loss is then defined as
[TABLE]
Here, we try to minimize spectral loss between and only for pixel areas where and have large positive and negative correlations. Note that is not trainable.
For our spatial loss , we try to minimize the difference between the gradient map of grayed (denoted as ) and that of , which is formulated as
[TABLE]
where for is a function defined as
[TABLE]
We incorporated into in (9), so that focuses more on those areas where is less focused. Finally, combining and , we have our final S3 loss as
[TABLE]
where is a weighting value. We empirically set to 1.
In order to show the effectiveness of our S3 loss, we incorporated our S3 loss into the state-of-the-art pan-sharpening networks, PanNet [16], BDPN [20] and DSen2 [19], which are named as PanNet-S3, DSen2-S3 and DSen2-S3 in our experiments. The DSen2 network has 14 convolutional layers with 128 channels, having about 1.8M filter parameters. PanNet has 10 layers with 76K parameters, while BDPN has 46 layers with 1.4M parameters. As for our PanNet-S3 and BDPN-S3, full data of MS-PAN inputs were concatenated and used as the network input.
III Experiment Results and Discussions
III-A Experiment Settings
III-A1 Datasets
All the networks including ours and baselines were trained and tested on the WorldView-3 satellite image dataset, whose PAN images are of about 0.3 GSD and MS images are of about 1.2 GSD. PS images of 0.3 GSD are also provided in the dataset, but they are used only for a visual comparison purpose with our results. Note that the WorldView-3 satellite image dataset has the shortest GSD (highest-resolution) among aforementioned datasets. We selected and used the WorldView-3 satellite image dataset from SpaceNet Challenge dataset [26]. The RGB channels of the MS images were used for all experiments. Total 13K MS-PAN image pairs were used for training networks, where cropping and various data augmentations were conducted on the fly during the training. The MS-PAN training subimages were created by applying a down-scaling method in [19]. The cropped MS subimages used for training are 3232-sized, while PAN subimages are of 128128 size. Before being fed into the networks, the training image pairs were normalized to have a range between 0 and 1. Training was done in the lower scale scenario.
III-A2 Training
We trained all the networks using the decoupled ADAMW optimization [27] with an initial learning rate of , initial weight decay of , and the other hyper-parameters as defaults. The mini-batch size was set to 2. We employed a uniform weight initialization technique in [28]. All the networks including our proposed networks were implemented using TensorFlow [29], and were trained and tested on Nvidia Titan Xp GPU. The networks were trained for total iterations, where the learning rate and weight decay were lowered by a factor of 10 after iterations. In our PanNet-S3, initial learning rate and weight decay were set to and , respectively. In our BDPN-S3, we used for the hyper-parameter in our S3 loss, and in the S3 loss was empirically set to 2.
III-B Comparisons and Discussions
We now compare our proposed methods using the S3 loss, with the conventional pan-sharpening methods including bicubic, PS images provided from the WorldView-3 dataset, PanNet [16], BDPN [20] and DSen2 [19]. We implemented PanNet, BDPN and DSen2 according to their technical descriptions, and trained them on the WorldView-3 dataset. At testing, for MS input images with a size of 160160, average computation time for our DSen2-S3 on GPU is about 2 sec per image.
As in [11, 16], we use two popular metrics: Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [30] for measuring spectral distortion, and spatial correlation coefficient (SCC) [31] for measuring spatial distortion. Lower is better for ERGAS, whereas higher is better for SCC. Note that in the original scale scenario, there are no ground truth PS images for comparison. Therefore, in this letter, given a PS output at the original scale, SCC0 between the network output and PAN input, ERGAS1 between a down-scaled network output and its MS input, and SCC1 between the down-scaled network output and down-scaled PAN input were computed.
Table I shows average quality metric scores (with standard errors), ERGAS1, SCC1 and SCC0 for PS results at the PAN resolution (0.3 GSD). Here, 100 MS-PAN pairs from the WorldView-3 satellite test dataset were selected for testing in the original scale scenario. As shown in Table I, the PS results by PanNet, BDPN and DSen2 have lower ERGAS values, showing lower spectral distortion, but lower SCC values, indicating higher spatial distortion. On the other hand, our methods with the S3 loss generated the PS images with much higher SCC values, but with slightly higher spectral distortion. Note that, since ERGAS simply computes the score values of spectral distortion between MS and PAN test image pairs that are often misaligned with unknown magnitudes and directions, it may not be effective in measuring the distortions for misaligned MS-PAN pairs.
Here, we additionally propose a more effective spectral distortion metric, called n-ERGAS, which is a simple variant of ERGAS inspired by an evaluation method used in the NTIRE 2018 Super-Resolution Challenge [32] for misaligned input-target pairs. In this challenge, input images were randomly translated from the corresponding target images, and a new metric was used for evaluation. As for our n-ERGAS, once we obtain a PS result image, 6-pixel translations are applied to obtain 144 translated PS images. Next, multiple ERGAS scores are computed using down-scaled versions of these translated images and an MS input, and the most favorable (smallest) ERGAS score is selected as the final ERGAS score for evaluation. As methods using our proposed S3 loss can reconstruct spectral information of misaligned MS on spatially correlated areas with PAN, their n-ERGAS scores should be lower (thus better) than the corresponding ERGAS scores. The n-ERGAS1 scores for the various methods are presented in Table I. As shown, all our methods (PanNet-S3, BDPN-S3 and DSen2-S3) have lower n-ERGAS scores compared to the corresponding ERGAS scores, while almost no difference is observed between n-ERGAS and ERGAS scores for the baselines (PanNet [16], BDPN [20] and DSen2 [19]). This indicates that our S3 loss indeed tries to minimize the spectral distortion more on spatially correlated areas with PAN, demonstrating the effectiveness of using our S3 loss for misaligned MS-PAN images.
We now visually compare several pan-sharpening methods including ours. Fig. 2 shows PS images for various methods on WV3. First, PS images provided from the dataset show high spectral distortion, with blue glow around cars. Since trained using a simple loss between network outputs and MS targets, PanNet, BDPN and DSen2 tend to perform poorly on misaligned MS-PAN test inputs, creating unpleasant artifacts around strong edges and moving objects in the PS images. On the other hand, our method using the proposed S3 loss can reconstruct PS images with highly sharpened edges, rooftops, roads and cars with much less artifacts, visually outperforming the conventional methods. However, some spectral artifacts are slightly visible around cars, indicating that there is still room for improvement. Nevertheless, the results using conventional PanNet, BDPN and DSen2 methods still suffer from ghosting and double-edge artifacts, degrading the overall visual quality. This confirms that our proposed S3 loss can be used for various networks to generate PS images with higher visual quality and less artifacts, compared to their baselines.
Moreover, we conducted experiments using two additional satellite datasets: the WorldView-2 (WV2) dataset and the KOMPSAT-3A (K3A) dataset. The WV2 dataset is of 11 bits per pixel, and includes PAN images of 0.5 m GSD and MS images of 2.0 m GSD. The K3A dataset is of 14 bits per pixel, and includes PAN images of 0.7 m GSD and MS images of 2.8 m GSD. Fig. 3 and 4 show pan-sharpening results at the original scale using various methods on the WV2 dataset and the K3A dataset, respectively. As shown, similar to the experiment results using the WV3 dataset, PS images using our DSen2-S3 method trained with WV2 have a slightly higher spectral distortion compared to MS inputs (higher ERGAS), but their spatial details are much similar to those of PAN inputs (higher SCC). This implies that our S3 loss is effective and robust for different types of satellite datasets.
We now present experiment results to show the effectiveness of using the correlation map in our S3 loss. Here, we set , so that the correlation map was not used in training. Fig. 5 shows pan-sharpening results at the original scale on the WorldView-3 test dataset for our DSen2-S3 without using the correlation maps . As shown, simply adding the spatial loss regarding PAN inputs would not be able to overcome artifacts, and much more spectral distortions are visible around moving cars if we do not incorporate the correlation map into our S3 loss. Therefore, we can confirm that the correlation map plays an important role in our S3 loss.
IV Conclusion
We proposed a novel spectral-spatial structure (S3) loss that can be effectively applied for CNN-based pan-sharpening methods. Our S3 loss is featured with a combined measuring capability of spectral, spatial and structural distortions, so that the CNN-based pan-sharpening networks can be effectively trained to generate highly detailed PS images with less artifacts, compared to the conventional losses simply based on the difference between network outputs and MS targets.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. R. Gillespie, A. B. Kahle, and R. E. Walker, “Color enhancement of highly correlated images. i. decorrelation and HSI contrast stretches,” Remote Sensing of Environment , vol. 20, no. 3, pp. 209–235, Dec. 1986.
- 2[2] W. J. Carper, T. M. Lillesand, and P. W. Kiefer, “The use of intensity-hue-saturation transformations for merging spot panchromatic and multispectral image data,” Photogrammetric Engineering and Remote Sensing , vol. 56, Jan. 1990.
- 3[3] V. P. Shah, N. H. Younan, and R. L. King, “An efficient pan-sharpening method via a combined adaptive PCA approach and contourlets,” IEEE Transactions on Geoscience and Remote Sensing , vol. 46, no. 5, pp. 1323–1335, May 2008.
- 4[4] X. Kang, S. Li, and J. A. Benediktsson, “Pansharpening with matting model,” IEEE Transactions on Geoscience and Remote Sensing , vol. 52, no. 8, pp. 5088–5099, Aug. 2014.
- 5[5] Q. Xu, B. Li, Y. Zhang, and L. Ding, “High-fidelity component substitution pansharpening by the fitting of substitution data,” IEEE Transactions on Geoscience and Remote Sensing , vol. 52, no. 11, pp. 7380–7392, Nov. 2014.
- 6[6] S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , no. 7, pp. 674–693, 1989.
- 7[7] J.-L. Starck, J. Fadili, and F. Murtagh, “The undecimated wavelet decomposition and its reconstruction,” IEEE Transactions on Image Processing , vol. 16, no. 2, pp. 297–309, 2007.
- 8[8] Z. Pan, J. Yu, H. Huang, S. Hu, A. Zhang, H. Ma, and W. Sun, “Super-resolution based on compressive sensing and structural self-similarity for remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing , vol. 51, no. 9, pp. 4864–4876, Sep. 2013.
