TL;DR
This paper introduces a novel MRF-based depth upsampling method guided by image and 3D surface normal features, leveraging camera models to improve urban environment mapping accuracy.
Contribution
The paper proposes a new regularization term based on surface planarity, enhancing depth upsampling in urban scenes with predominantly planar surfaces.
Findings
Outperforms recent distance-based regularization methods on synthetic data
Improves depth upsampling quality in urban environments
Validated on real mapping applications with an experimental vehicle
Abstract
We present an improved model for MRF-based depth upsampling, guided by image- as well as 3D surface normal features. By exploiting the underlying camera model we define a novel regularization term that implicitly evaluates the planarity of arbitrary oriented surfaces. Our method improves upsampling quality in scenes composed of predominantly planar surfaces, such as urban areas. We use a synthetic dataset to demonstrate that our approach outperforms recent methods that implement distance-based regularization terms. Finally, we validate our approach for mapping applications on our experimental vehicle.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Guided Depth Upsampling for Precise Mapping of Urban Environments
Sascha Wirges1, Björn Roxin2 , Eike Rehder2, Tilman Kühner1 and Martin Lauer2 1Sascha Wirges and Tilman Kühner are with the FZI Research Center for Information Technology and Karlsruhe Institute of Technology, Germany. {wirges,kuehner}@fzi.de2Björn Roxin, Eike Rehder and Martin Lauer are with the Institute of Measurement and Control, Karlsruhe Insitute of Technology, Germany [email protected],{rehder,lauer}@kit.edu
Abstract
We present an improved model for MRF-based depth upsampling, guided by image- as well as 3D surface normal features. By exploiting the underlying camera model we define a novel regularization term that implicitly evaluates the planarity of arbitrary oriented surfaces. Our method improves upsampling quality in scenes composed of predominantly planar surfaces, such as urban areas. We use a synthetic dataset to demonstrate that our approach outperforms recent methods that implement distance-based regularization terms. Finally, we validate our approach for mapping applications on our experimental vehicle.
I Introduction
Perception and localization algorithms developed for automated driving tasks rely on accurate environment models. These models are usually generated using information provided by mobile sensors such as cameras or range sensors. Whereas cameras provide 2D projections of surface reflectances with high spatial resolution, range sensors usually provide precise 3D surface positions. However, the spatial resolution of modern range sensors is sparse compared to cameras.
Currently, most systems perform environmental mapping within one sensor domain which has several drawbacks. Common methods usually perform feature estimation and matching to find corresponding surface landmarks between subsequent measurement frames. For camera-based mapping methods, the scale might be either subject to drift or hard to estimate accurately in the calibration process which results in globally inconsistent maps. For range sensor-based methods the resulting map may consist of accurate but spatially sparse 3D points which inherently induces errors on surface feature estimation and reconstruction. Thus, our aim is to combine the strengths of both sensor types to generate a map that consists of spatially dense surface features.
Here, we propose a guided depth upsampling method that estimates surfaces accurately for each camera pixel within scenes composed of predominantly planar surfaces, such as urban areas. Provided with a calibrated camera-laser setup, the 3D surface point position can be determined by evaluating the viewing ray corresponding to an image coordinate at an estimated depth.
However, as different image areas usually have varying 3D point densities, the quality of depth upsampling might vary drastically. Therefore, we are also interested in finding a confidence measure for each depth estimate. We show that our method is capable of performing accurate upsampling within image areas that contain only few 3D point observations. Finally, we provide a filtering method that stems from our optimization model to filter out ill-conditioned depths.
By describing similarities and differences of related work in guided depth upsampling in section II, we show common drawbacks and emphasize our ideas to overcome these problems. Based on these findings, we formalize our objectives in depth upsampling and derive the underlying Markov Random Field model in section III. We will then validate our approach on a photorealistic indoor dataset and our experimental vehicle (section IV). Finally, we conclude our findings in section V and show our next plans in guided depth upsampling.
II Motivation and Related Work
Our general objective in depth upsampling is to estimate depths for each image coordinate of the image . Depth observations from range sensors might be available only for a small subset of image coordinates.
Assuming a calibrated camera-laser setup, each can be transformed into a corresponding 3D point
[TABLE]
using the viewing ray function that includes for each image coordinate the direction and the viewpoint of the respective line of sight. In the following, we will use and interchangeably.
Upsampling methods may be divided into local or global methods. Whereas local methods [1], [2] can be used for upsampling images with mostly dense and uniformly distributed depth observations, global methods show better performance on data with sparse and non-uniformly distributed observations. In urban mapping, however, the number of observations might vary drastically, depending on the scene setting. Therefore, we focus on global upsampling methods.
Within global methods, the optimal depths arranged in
[TABLE]
minimize the cost function that may be composed of various cost terms. [3] models this problem as a Markov Random Field with a cost term
[TABLE]
that does not only minimize costs towards the given observation data, but also within the direct neighborhood of each image coordinate , where the regularization term
[TABLE]
is used to enforce that estimated depths of direct image coordinate neighbors (e.g. within a 4-connected grid) are similar. However, their model assumption does not hold for arbitrary planes as it regularizes towards similar depths.
The weights in equation (4) might be used to include additional information on the problem. Whereas invariant weights are used in image filtering applications [4], weights depending on image features can guide upsampling and thus improve quality. Moreover, in all guided approaches, image features are used to indicate depth discontinuities (see table I). In particular, [5] shows that image values and range measurements share second order statistics. Based on this work, either gray scale [3] or color intensity gradients are used. [6] includes semantic information in the regularization term and determines extended neighborhoods based on geodesic distances. Even higher-order terms such as the anisotropic diffusion tensor [7] or [8] might be used. The authors add a non-local means regularization term, which uses an anisotropic structural-aware filter to allow similar pixels in extended neighborhoods to reinforce each other.
Although guided approaches based on image features have been studied extensively, a major drawback of existing methods is the lack of incorporating 3D features into the upsampling process. Therefore, we show the benefit of including 3D surface normals into our problem.
Even if recent methods achieve accurate results, they do not account for confidences in the estimation problem. We provide a simple method based on estimating the parameter covariance of the underlying optimization problem at the end of the next section.
III Guided Depth Upsampling
For each image coordinate , we aim to determine its depth and a depth confidence measure . To achieve this, we require a calibrated camera-laser rig that provides viewing ray lookup functions as described in equation (1) and the transform between range sensor and camera frame to be known. Given , observed 3D point features can be transformed into the camera frame and mapped to the image coordinate .
As in equation (3) we model our upsampling problem as a Markov Random Field containing data costs and regularization costs . We can include additional image features into the optimization problem which we explain in section III-C. These cost terms should be minimized starting from depth priors determined by our initialization strategy explained in section III-D. In the following, we describe the different energy functions included in our model.
III-A Data Costs
For each observation we set up data costs
[TABLE]
weighted by . Here, is the direct neighborhood of image coordinate which we choose to be a 4-connected grid.
Since we want to include depth observations from range sensors, depth residuals
[TABLE]
evaluate the difference between estimated and observed depths for each image coordinate .
In addition, we include estimated surface normals from range sensors as pseudo measurements into our problem. Therefore, normal residuals
[TABLE]
evaluate the signed point-to-plane distance between the constructed plane and the point . The plane is constructed from the surface normal and can be expressed in normal form
[TABLE]
III-B Regularization Costs
To model coupling in Markov sense, we add regularization cost terms for each image coordinate within its direct neighborhood . Whereas 8-connected grids provide a better coupling with a large number of residual blocks which decreases optimization speed, choosing two neighbors will lead to poor coupling and decrease convergence. Thus, we choose 4-connected grids as they provide a good trade-off between coupling and the amount of coupling residuals in the problem.
We aim to minimize the regularization cost
[TABLE]
for each image coordinate in the image . Here, is a weighting term depending on and a subset of the neighborhood . We will explain in section III-C how is composed. In the simplest case the residual terms are evaluated between pairs of image coordinates as presented in equation (4), where . Here, we extend the residual computation to be dependent on sets of multiple image coordinates.
Instead of regularizing towards constant depth (e.g. as in [3]), we enforce the surface points to be coplanar. Thus, we aim to find an appropriate residual term that shows good convergence properties.
One option would be to estimate surface normals explicitly based on all points in the corresponding neighborhood and find an appropriate point-to-plane residual, similar to equation (7).
We are, however, not interested in computing normals directly, but instead finding a residual term that evaluates the planarity of surface points. Here, we assume neighboring viewing rays along one row or column to be coplanar. Although, this assumption might not hold for arbitrary camera models, it can be justified for a sufficiently small neighborhood around a reference image coordinate. As the intersection between the plane spanned by these viewing rays and an ideal surface plane forms a line, we can add the residual
[TABLE]
that evaluates whether triples of points are collinear. As depicted in figure 2,
[TABLE]
and are the pairwise differences between the points , where and are coplanar.
Using the collinearity residual in equation (10), we can add one residual term for direct neighbors with the same image row and one term for neighbors with the same column. As only three parameters are coupled within each residual, the problem sparsity is increased which leads to better convergence properties compared to explicitly estimating surface normals.
III-C Regularization weights
The collinearity residual in equation (10) should only be applied to areas satisfying the assumption of planar surfaces. To accomplish this, we use additional image features expressed as weights .
Here, we employ weights that are defined between neighboring image coordinates and . For the collinearity residual, consists of three image coordinates and we determine pairwise weights
[TABLE]
as components of the regularization weights
[TABLE]
added for each image coordinate .
Pairwise weights are composed of a scalar weighting function and image features and . The weighting function might be exponential, sigmoid, step or even constant, which means that local image features have no influence on the regularization cost . However, it is important to note that arbitrary features might be used as long as they provide information about scene planarity.
III-D Prior Estimation
In our contribution, we do not focus on solving the optimization problem efficiently by analyzing the underlying problem structure. Please refer to [4] or [6] for hints on implementation details. Instead, we suggest an initialization method based on linear interpolation that significantly reduces optimization time and the number of iterations, respectively.
For our initialization method, projected 3D points need to be found for every query image coordinate. Therefore, we generate a kd-tree as described in [9] that includes the set of point projections within the image coordinate frame. This kd-tree search structure quickly provides references to the nearest laser point projections for any query image coordinate.
As depicted in Figure 3, we generate a triangle mesh of the point cloud and project it into the image. For each image coordinate within a triangle, we intersect the corresponding with the plane constructed by the three points defining the triangle which we use as depth initialization. Image coordinates that are not covered by any triangle, will be assigned the depth of its nearest neighbor. This might be the case at image borders or 3D points not connected by the meshing algorithm.
III-E Covariance Estimation
In some scenarios, depth estimation may not work well. On the one hand the range sensor’s field of view might not cover the camera’s field of view. On the other hand, by evaluating equation (13) image areas might be decoupled from their neighborhood and no depth observations exist in this area. This might be the case when the image features of a pixel neighborhood indicate non-planar surfaces in a closed area and thus scale down regularization costs.
To resolve these problems, we aim to assign a confidence measure to each estimated depth after optimization by evaluating the covariance
[TABLE]
where the variances can then be obtained by evaluating .
Knowing an estimate for we can then set a threshold and keep only those distances with variances below that threshold.
III-F Implementation
Equations (5) and (9) show that the Markov Random Field formulation can be expressed as a nonlinear least-squares problem for which we aim to find to optimal parameters, i.e. parameters that minimize the overall costs. The problem consists of many residual terms, each of them depending on either one or three parameters. In total, we add one residual term for each depth observation and approximately two residual terms for each pixel if image borders are disregarded. The resulting problem can then be solved by Trust-Region methods using a linear solver efficiently exploiting the sparse problem structure.
We implemented our Markov Random Field-based upsampling method as a C++ library which will be publicly available on https://github.com/fzi-forschungszentrum-informatik/mrf. It is based on Ceres Solver [10], an optimization framework used to solve large-scale, non-linear least squares problems. As residual blocks can be added one by one, Ceres itself exploits the sparse structure and uses state-of-the-art sparse linear solver libraries in its backend. Additionally, parameters can be constrained on minimum or maximum bounds that we set to the minimum and maximum depth observed by the range sensor.
IV Applications and Experiments
In section IV-A, we introduce our performance metrics and show evaluation results on a photorealistic RGB-D dataset.
We then present our experimental platform for the mapping of urban environments as an application of guided upsampling and perform a qualitative evaluation in section IV-B.
For both applications, we compare our approach to the model presented in [3] where a constant distance regularization is used.
IV-A Photorealistic Indoor Dataset
We evaluate our approach on a subset of 150 images of the SceneNet RGB-D [11] dataset. It provides RGB-D sensor data from photo-realistic synthetic indoor scenes which are semantically labelled by instances and a camera model.
In the dataset ground-truth depth information is available for each estimated depth for all image coordinates in the image . Here, we determine the mean
[TABLE]
and the median of the absolute error . For each evaluation, we also provide the downsampling ratio
[TABLE]
which is defined by the amount of 3D observations divided by the image size.
Figure 4 shows the upsampling quality for different image features used. We observe that semantic features drastically improve the upsampling quality. Our method achieves an mean absolute error of about 17 mm and an even lower median absolute error if RGB and semantic features are used. Here, the average downsampling rate is 1.
Figure 5 depicts the absolute errors depending on different downsampling ratios, i.e. the sparsity of 3D observations. For our evaluations, we performed equidistant as well as random downsampling. Whereas the mean absolute errors are comparable for a larger number of observations, our approach outperforms for few observations. The reason might be a more realistic regularization in scenes containing a moderate amount of planar surfaces.
IV-B Experimental Vehicle
Figure 6 depicts the upsampling pipeline implemented for our experimental vehicle.
Our platform is equipped with a Velodyne HDL64E-S2 lidar and a high definition RGB camera. The lidar is mounted on top of the vehicle to generate range sensor data structured as a 3D point cloud. RGB images are provided by a Teledyne Dalsa Genie TS-C4096 color camera with an approximate resolution of 12 Megapixels which is mounted externally above the windshield. Camera and laser are triggered at the same rate and the pose between laser scanner and camera can be assumed calibrated to an accuracy of . For one scenario, the projection of laser points into the camera image is depicted in the top image of Figure 1.
Based on 3D point cloud information provided, we estimate surface normals similar to [12] for each point observed. The method is based on a Principle Component Analysis of all points within a search radius around a query point. The search radius might be adapted depending on the range sensor model. We assign the Eigenvector corresponding to the smallest Eigenvalue to the surface normal of that query point. These surface normals may then be included as pseudo measurements into our guided depth upsampling system. An exemplary normal estimation result is depicted on the top right corner of Figure 7.
Our image features
[TABLE]
are composed of the RGB value and a semantic certainty . Therefore, we predict semantic classes using GoogLeNet [13] adapted as FCN-8s [14]. The network was trained on a 14-class subset of the cityscapes dataset [15]. Apart from the arg-max class predictions, we utilize the semantic certainty
[TABLE]
of that class, where is the number of classes. It is computed from the network’s softmax output , i.e. the output’s improvement over guessing normalized to the maximum possible improvement. For certain predictions this value becomes 1 while at class boundaries, it drops to 0.
Finally, pair-wise weights
[TABLE]
are calculated where a scaling function is applied to the difference in the RGB space between pixel and , weighted with the semantic class certainty at pixel . The regularization weights as applied in equation (13) are depicted in the top left corner of Figure 7.
Based on regularization weights, camera model and 3D surface point normals, upsampling is performed. The upsampled depth image for this scenario is depicted on the bottom of figure 1. Using the ray lookup function in equation (1), we can transform this depth image into a 3D point cloud which is depicted for a shifted viewpoint on the bottom of Figure 7. We observe that for a soft regularization scaling due to the semantic certainty, some objects in the scene are not completely separated from the environment. However, our approach accurately estimates planar surfaces such as house fronts or ground surfaces.
V Conclusion
We presented an approach for guided depth upsampling of range sensor data based on a novel regularization term that preserves plane surfaces. Furthermore, we do not only incorporate 2D image features into our model but also 3D surface normals. By using a novel regularization term evaluating surface planarities, we show that our method outperforms state-of-the-art methods regularizing towards constant depths. Finally, we suggest a method to filter ill-conditioned data based on estimating the covariance matrix after optimization. As the upsampling quality is sensitive to calibration and synchronization errors, we would also like to include the transformation between laser and camera into the optimization problem which might lead to a one-shot extrinsic calibration technique.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” ACM Transactions on Graphics , vol. 26, no. 3, p. 96, 2007.
- 2[2] M. Y. Liu, O. Tuzel, and Y. Taguchi, “Joint geodesic upsampling of depth images,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , pp. 169–176, 2013.
- 3[3] J. Diebel and S. Thrun, An Application of Markov Random Fields to Range Sensing . MIT Press, 2005.
- 4[4] Q. Chen and V. Koltun, “Fast mrf optimization with application to depth reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2014, pp. 3914–3921.
- 5[5] A. B. Lee, K. S. Pedersen, and D. Mumford, “The complex statistics of high-contrast patches in natural images,” 2001.
- 6[6] N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller, “Semantically Guided Depth Upsampling,” Gcpr , 2016.
- 7[7] D. Ferstl, C. Reinbacher, R. Ranftl, and H. Bischof, “Image Guided Depth Upsampling using Anisotropic Total Generalized Variation,” 2013.
- 8[8] J. Park, H. Kim, Yu-Wing Tai, M. S. Brown, and I. Kweon, “High quality depth map upsampling for 3D-TOF cameras,” Proceedings of the IEEE International Conference on Computer Vision , pp. 1623–1630, 2011.
