Late or Earlier Information Fusion from Depth and Spectral Data? Large-Scale Digital Surface Model Refinement by Hybrid-cGAN
Ksenia Bittner, Marco K\"orner, Peter Reinartz

TL;DR
This paper introduces a Hybrid-cGAN approach for refining digital surface models by fusing spectral and height data early in the process, leading to improved detail, accuracy, and boundary rectilinearity in 3D building representations.
Contribution
The paper proposes a novel Hybrid-cGAN architecture that fuses spectral and height information early, enhancing DSM refinement and 3D building detail reconstruction.
Findings
Early data fusion improves fine detail propagation.
Enhanced boundary rectilinearity in 3D models.
Effective refinement of DSMs with spectral and height data.
Abstract
We present the workflow of a DSM refinement methodology using a Hybrid-cGAN where the generative part consists of two encoders and a common decoder which blends the spectral and height information within one network. The inputs to the Hybrid-cGAN are single-channel photogrammetric DSMs with continuous values and single-channel pan-chromatic (PAN) half-meter resolution satellite images. Experimental results demonstrate that the earlier information fusion from data with different physical meanings helps to propagate fine details and complete an inaccurate or missing 3D information about building forms. Moreover, it improves the building boundaries making them more rectilinear.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Remote Sensing and LiDAR Applications · Advanced Vision and Imaging
Late or Earlier Information Fusion from Depth and Spectral Data? Large-Scale Digital Surface Model Refinement by Hybrid-cGAN
Ksenia Bittner, Peter Reinartz
German Aerospace Center (DLR)
Munich, Germany
Marco Körner
Technical University of Munich (TUM)
Munich, Germany
Abstract
We present the workflow of a digital surface model (DSM) refinement methodology using a Hybrid-cGAN where the generative part consists of two encoders and a common decoder which blends the spectral and height information within one network. The inputs to the Hybrid-cGAN are single-channel photogrammetric DSMs with continuous values and single-channel pan-chromatic (PAN) half-meter resolution satellite images. Experimental results demonstrate that the earlier information fusion from data with different physical meanings helps to propagate fine details and complete an inaccurate or missing 3D information about building forms. Moreover, it improves the building boundaries making them more rectilinear.
1 Introduction
Digital surface models (DSMs) with detailed and realistic building shapes are beneficial data sources for many remote sensing applications, such as urban planning, cartographic analysis, environmental investigations, etc. Several ways for deriving 3D elevation morphology, but the DSMs generated from space-borne data using stereo image matching techniques nowadays have lower costs compared to other technologies, show relatively high spatial resolution and wide coverage which is important for large-scale remote sensing applications, such as disaster monitoring.
However, apart from many advantages, DSMs generated by image-based matching techniques show a reasonable amount of noise and outliers because of matching errors or occlusions due to densely located buildings or trees, which cover parts of the building constructions. To solve these problems, algorithms from computer vision have been analyzed and adapted to satellite imagery processing. In the literature, very few of the proposed approaches work towards photogrammetric DSM improvement of urban areas. In contrast, earlier methodologies investigate the DSMs refinement by applying filtering techniques, such as Gaussian [1], Kalman [2], geostatistical [3], or interpolation routines, including inverse distance weighting (IDW) and kriging [4]. Despite achieving smoother surfaces of roof planes, they negatively influence the steepness of the building walls. Later, some methodologies proposed to additionally utilize spectral images for DSM refinement tasks, as, for instance, they contain accurate information about object boundaries or texture. For instance, [5] transfer segmentation information from stereo satellite imagery to the DSM and, from statistical analysis and spectral information, perform object detection and classification. As a further step, this information is used to refine the DSM. \Citetsirmacek2010enhancing propose building shape refinement on DSMs by a multiple-step procedure. First, they extract possible building segments by thresholding the normalized digital surface model (nDSM). Then, by considering Canny edge information from the spectral image, they fit rectangular boxes to detect building shapes. Finally, applying this information about detected rectangular boundaries, the building shapes are further enhanced. The drawback of this methodology is that they assign one single height value to each generated building object. Moreover, it is limited to the detection and enhancement of rectangular buildings only.
Within recent years, the already well-studied branch of the deep learning family, convolutional neural networks (CNNs), has also been applied to the remote sensing field. They achieve state-of-the-art results for image classification, object detection, or semantic segmentation tasks. However, most of these methodologies extract information from spectral imagery, while depth image processing is still not well explored, as it is not straightforward to work with continuous values. For example, [7] employ cascade deep network which first performs a coarse global prediction from a single spectral image and refines the predictions locally afterwards. \Citetliu2016learning join CNNs and conditional random fields (CRFs) in a unified framework while making use of superpixels for preserving sharp edges. In case of height prediction problems, only a couple of works made attempts to improve DSMs. Our earlier approaches [9, 10] propose to refine building shapes to a high level of details from photogrammetric half-meter resolution satellite DSMs using generative adversarial network (GAN)-based techniques. Mainly, they investigate conditional generative adversarial networks (cGANs) with two objective functions—i.e. the negative log-likelihood and least square residuals—to generate accurate light detection and ranging (LIDAR)-like and LoD-2-like DSMs with enhanced 3D building shapes directly from noisy input data. Moreover, the experiment performed on a completely new dataset, belonging to different geographical areas, showed the network’s potential to generalize well to different cities with complex constructions without many difficulties. As it is common in the field of remote sensing to fuse data with different modalities to complement missing knowledge, in our followed work [11] we introduce a cGAN-based network which merges depth and spectral information within an end-to-end framework. Fusing data from separate networks—which are fed with PAN images and DSMs—is performed at a later stage right before producing the final output.
Following up to the aforementioned approach, in this work, we investigate the fusion of spectral (Fig. 1(a)) and height (Fig. 1(b)) information at an earlier stage within an end-to-end Hybrid-cGAN network to further improve not only the small and simple residential buildings, but also complex industrial ones. Besides, we add an auxiliary normal vector loss term to the objective function enforcing the model to produce more planar and flat roof surfaces, similar to the desired LoD 2-DSM (Fig. 1(c)) artificially produced from city geography markup language (CityGML) data.
2 Methodology
2.1 Objective function
The advent of GAN-based domain adaptation neural networks, introduced by [12], lead to great performance gains in generating realistic, but entirely artificial images. These GANs realize an adversarial manner of learning by training a pair of networks in a competing way: A generator produces a fake image from a latent vector drawn from any distribution —e.g. a uniform distribution —which looks like a real one . Adversarially, a discriminator tries to decide whether a presented image is a real or a generated fake one. Frequently, some external information in the form of a source image is additionally used as an input to restrict both the generator in its output and the discriminator in its expected input. The objective function for such conditional GANs can be expressed through a two-player minimax game
[TABLE]
where denotes the expectation value. To overcome the problem of instability during training cGAN when its objective function is based on the negative log-likelihood, we use a least square loss instead which yields the conditional least square generative adversarial network (cLSGAN) objective function
[TABLE]
In order to obtain DSMs in which the buildings feature sharply defined ridgelines and steep walls, we utilize the distance
[TABLE]
which prevents blurring effects.
As our major goal is to improve roof surfaces by making them flatter and looking closer to realistic ones, we integrate a normal vector loss term
[TABLE]
into the learning process, as proposed by [13]. This normal vector loss measures the angle between the set of surface normals of an estimated DSM and the set of surface normals of the target LoD 2-DSM. Adding those three terms together leads to our final objective
[TABLE]
where intents to minimize the objective function against that aims to maximize it. The parameters and are the balancing hyper-parameters.
2.2 Network architecture
We have already made an attempt to investigate the LoD 2-like DSM generation with enhanced building shapes from a single noisy photogrammetric DSM using cGAN architecture with an objective function based on least square residuals [10]. In this work, we refer to it as a single-stream cGAN. As each photogrammetric DSM is a product obtained from panchromatic image pairs, it seems reasonable to integrate depth and spectral data into one single network, as the latter provides sharper information about building silhouettes, which allows not only a better reconstruction of missing building parts but also the refinement of building outlines. We fuse two separate but identical UNet-type networks at the later end within the part of a cGAN, where the first stream is fed with the PAN image and the second stream with the stereo DSM, creating a so-called WNet architecture [11].
In this paper, we examine the potential of an earlier fusion of data from different modalities, as it could potentially even better blend together the depth and spectral information. Mainly, the generator of the proposed Hybrid-cGAN network consists of two encoders and , concatenated at the top layer, and a common decoder, which integrates information from two different modalities and generates an LoD 2-like DSM with refined building shapes. The inputs to are the single-channel orthorectified PAN images, while receives the single-channel photogrammetric DSMs with continuous values. Since intensity and depth information have different physical meanings, it is unlikely to make sense to jointly propagate them right from the beginning. It seems reasonable to separate them first and allow the network to learn the most valuable features from each modality itself. The generator is constructed based on an U-form network proposed by [14]. As a result, in our case it has 14 skip connections from both the spectral stream and the depth stream allowing the decoder to learn back detailed features that were lost by pooling in the encoders. In particular, the encoder and decoder consist of 8 convolutional layers each, followed by a leaky rectified linear unit (LReLU) [15] activation function
[TABLE]
and batch normalization (BN) in case of the encoder, and a rectified linear unit (ReLU) activation function
[TABLE]
and BN in case of the decoder. On top of the generator network , the activation function is applied.
The discriminator network is a binary classification network constructed with 5 convolutional layers, followed by LReLU activation function and a BN layer. It has a sigmoid activation function at the top layer to output the likelihood that the input image belongs either to class 1 (\enquotereal) or class 0 (\enquotegenerated). A schematic representation of the proposed architecture is depicted in Fig. 2.
All along the training phase, the two networks and are trained at the same time by alternating one gradient descent step of and one gradient descent step of . During the inference process, only the trained generator model of the Hybrid-cGAN network is involved.
3 Study Area and Experiments
3.1 Dataset
Experiments have been carried out on data showing the city of Berlin, Germany, covering a total area of . It consists of half-meter resolution orthorectified PAN images showing the closest nadir view and photogrammetric DSMs generated with semi-global matching (SGM) [16] from six panchromatic Worldview-1 images acquired at two different days. As ground truth, the artificial LoD 2-DSM, generated with a resolution of from a CityGML data model together with an available digital terrain model (DTM), was used. The CityGML data model is freely available from the download portal Berlin 3D 111http://www.businesslocationcenter.de/downloadportal. We followed the same LoD 2-DSM creation procedure as in [10].
3.2 Implementation and Training Details
The proposed Hybrid-cGAN network was realized using the PyTorch python package based on the implementation published by [14]. The prepared training dataset covers and consists of pairs of patches of size obtained by tiling the given satellite image on the fly with random overlap in both horizontal and vertical directions. This procedure provides the network the possibility to learn building shapes, which, during one epoch, may be located on the patch border and, as a result, are only partially introduced to the network. For the validation phase, an area covering was used for tuning the hyper-parameters. The Hybrid-cGAN network and others, used in this paper for comparison, were trained on minibatch stochastic gradient descent (SGD) using the Adam optimizer [17] with an initial learning rate of and momentum parameters and . We set the weighting hyper-parameters and after performing training and examining the resulting generated images from the validation dataset. The complete number of epochs was set to 200 with a batch size of 5 on a single NVIDIA TITAN X (PASCAL) GPU with 12 GB of memory.
Different to the training phase, where two networks and were trained at the same time by alternating one gradient descent step of and one gradient descent step of , in inference phase only the trained generator model was involved. Stitching the predicted LoD 2-like height images—entirely unseen during training and validation phases—with a fixed overlap of half the patch size in both horizontal and vertical directions, we generated the final full image covering an area of .
4 Results and Discussion
Selected test samples of the DSMs generated by the single-stream cGAN model [10], the two-stream WNet-cGAN model [11], and our proposed Hybrid-cGAN model are illustrated in Figs. 3, 5 and 6. One can notice that small buildings in all generated DSMs show more rectilinear borders and are not merged with adjacent trees, as present in the input DSMs (c.f. Figs. 3(c), 5(b) and 6(b)). The integration of spectral information into the model obviously benefits the building reconstruction process. First of all, the number of reconstructed buildings is increased. For instance, the magenta arrows in Fig. 3(b) highlight the areas in the DSM generated by the single-stream cGAN model, where the model was not able to reconstruct individual buildings, as opposed to the WNet-cGAN network and our proposed Hybrid-cGAN network. Second, the roof ridge lines are more distinguishable and rectilinear. This statement can be also confirmed by exemplary investigating the profiles of the two buildings highlighted in Fig. 3(f). From Fig. 4 we can notice that the Hybrid-cGAN network was able to reconstruct much finer building shapes more similar to the ground truth. Moreover, the surfaces of roof planes are smoother or even flat in many cases affirming the influence of the normal vector loss. The profiles also demonstrate the strength of all networks to separate the buildings from adjacent vegetation. From the demonstrated results we can further conclude that most of the generated building shapes followed the correct pattern and feature improved roof forms. However, this statement is true for many residential and not big industrial buildings. How does the network behave in case of big and complicated building structures? The single depth-stream cGAN model [10] and the two-stream WNet-cGAN model [11] only partially extract such buildings. In case of spectral information integration, although at the late fusion, it helps to improve the silhouette of the buildings as well as the detailed constructions on the rooftops (c.f. Figs. 6(e) and 5(e)) but still misses or has incompleted inside parts of structures. Moreover, as the input photogrammetric DSMs contain noise and many outliers, they propagate along the height image reconstruction and influence the results, as indicated by the dark blue areas in Figs. 5(d) and 5(e), and Figs. 6(d) and 6(e). On the other hand, the proposed Hybrid-cGAN network was able to not only eliminate those artifacts, but also to reconstruct the complete building structures at any single detail. Besides, although the building rooftops seem to be entirely flat in the ground truth data (c.f. Figs. 5(c) and 6(c))—which is not the case in reality—, such cases do not confuse the model during the training phase and make it capable to preserve detailed 3D information from input photogrammetric DSM. Those observations prove that the introduced Hybrid-cGAN architecture may successfully blend together the spectral and height information. The earlier combination of both modalities forces the network to accommodate the information even better.
In order to evaluate the generated elevation models quantitatively, we utilized error metrics commonly used in the relevant literature [18, 19, 20, 21], namely, the root mean squared error (RMSE)
[TABLE]
where and denote the actually observed and the predicted heights, respectively, height errors are defined as , and the median error is . The constant 1.4826, included in the NMAD metric, is comparable to the standard deviation if the data errors are distributed normally. This estimator can be considered more robust to outliers in the dataset [18].
To exclude the influence of time acquisition difference between the photogrammetric DSMs and the CityGML data model—carrying the risk of absence or the appearance of new buildings—, we manually checked the evaluation regions in this regard and carried out evaluation on regions showing buildings in both the photogrammetric DSM as well as in the reference LoD 2-DSM.
The obtained results for the three test area samples are presented in Tables 1, 2 and 3. In comparison to the other methods, the DSMs created by the single-stream model [10] showed inferior results in terms of for all three areas. As we have already pointed out before in Fig. 3(b), the model was not able to reconstruct some buildings at all or only partially reconstructed them, exemplary highlighted in Figs. 5(d) and 6(d).
With the intensity information integrated into the learning process, the RMSE error decreased and, in case of our proposed Hybrid-cGAN , achieved the lowest value, even smaller than for the input photogrammetric DSM. This observation provides evidence that the proposed model improves the noisy and inaccurate photogrammetric DSMs to a high level of details. Considering the other two metrics, Hybrid-cGAN also outperformed the competing models, except for the third test area. The single-stream DSM [10] achieved the lowest NMAD error with . This seems reasonable, as the buildings show flat roofs in the ground truth, which is not the case for neither the input photogrammetric DSM, nor for the DSMs generated by the models with spectral information integrated. As the NMAD metric is sensible to outliers, it shows higher values for the results with more detailed roofs.
Our results demonstrate that deep learning models visually produce fairly reasonable reconstructed DSMs. However, the used RMSE, NMAD, MAE metrics do not give good enough insight into the depth estimation quality, as they mainly consider the overall accuracy by reporting global statistics of depth residuals. Moreover, the available ground truth data with only sufficient quality also influences evaluation procedure.
5 Conclusion
We presented a methodology for automatic building shape refinement from low-quality digital surface models (DSMs) to the level of detail (LoD) 2 from multiple spaceborne remote sensing data on the basis of conditional generative adversarial networks (cGANs). The network automatically combines the advantages of pan-chromatic (PAN) imagery and photogrammetric DSM, while complementing their individual drawbacks, and, from obtained results, shows the potential of generating DSMs with completed not only residential but also industrial building structures. Moreover, the generated roof surfaces are smoother and more planar, giving evidence of the positive influence of the auxiliary normal vector loss function. Besides, a 3D visualization of the generated elevation models illustrates the realistic appearance of the buildings and their strong resemblance to the ground truth.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Jeffrey P Walker and Garry R Willgoose “A comparative study of Australian cartometric and photogrammetric digital elevation model accuracy” In Photogrammetric Engineering & Remote Sensing 72.7 American Society for Photogrammetry Remote Sensing, 2006, pp. 771–779
- 2[2] Ping Wang “Applying two dimensional Kalman filtering for digital terrain modelling” In Proceedings of International Archives of Photogrammetry, Remote Sensing, and Spatial Information Sciences , 1998, pp. 649–656
- 3[3] Angel M Felicísimo “Parametric statistical method for error detection in digital elevation models” In ISPRS Journal of Photogrammetry and Remote Sensing 49.4 Amsterdam, The Netherlands: Elsevier, c 1989-, 1994, pp. 29–33
- 4[4] ES Anderson, JA Thompson and RE Austin “LIDAR density and linear interpolator effects on elevation estimates” In International Journal of Remote Sensing 26.18 Taylor & Francis, 2005, pp. 3889–3900
- 5[5] Thomas Krauß and Peter Reinartz “Enhancement of dense urban digital surface models from VHR optical satellite stereo data by pre-segmentation and object detection” In International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 38.1 , 2010, pp. 6
- 6[6] Beril Sirmacek, Pablo d’Angelo, Thomas Krauss and Peter Reinartz “Enhancing urban digital elevation models using automated computer vision techniques” In International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences , 2010
- 7[7] David Eigen, Christian Puhrsch and Rob Fergus “Depth map prediction from a single image using a multi-scale deep network” In Advances in neural information processing systems , 2014, pp. 2366–2374
- 8[8] Fayao Liu, Chunhua Shen, Guosheng Lin and Ian Reid “Learning depth from single monocular images using deep convolutional neural fields” In IEEE transactions on pattern analysis and machine intelligence 38.10 IEEE, 2016, pp. 2024–2039
