CMRNet: Camera to LiDAR-Map Registration

Daniele Cattaneo; Matteo Vaghi; Augusto Luis Ballardini; Simone; Fontana; Domenico Giorgio Sorrenti; Wolfram Burgard

arXiv:1906.10109·cs.CV·July 12, 2021

CMRNet: Camera to LiDAR-Map Registration

Daniele Cattaneo, Matteo Vaghi, Augusto Luis Ballardini, Simone, Fontana, Domenico Giorgio Sorrenti, Wolfram Burgard

PDF

2 Repos

TL;DR

CMRNet is a real-time CNN-based method that localizes RGB images within a preexisting LiDAR map without prior training in the specific area, demonstrating high accuracy on the KITTI dataset.

Contribution

This paper introduces CMRNet, the first CNN approach to match monocular camera images to a preexisting LiDAR map for localization.

Findings

01

Achieves 0.27m median localization accuracy on KITTI sequence 00.

02

Operates without training in the specific working area.

03

Processes each frame independently without tracking.

Abstract

In this paper we present CMRNet, a realtime approach based on a Convolutional Neural Network to localize an RGB image of a scene in a map built from LiDAR data. Our network is not trained in the working area, i.e. CMRNet does not learn the map. Instead it learns to match an image to the map. We validate our approach on the KITTI dataset, processing each frame independently without any tracking procedure. CMRNet achieves 0.27m and 1.07deg median localization accuracy on the sequence 00 of the odometry dataset, starting from a rough pose estimate displaced up to 3.5m and 17deg. To the best of our knowledge this is the first CNN-based approach that learns to match images from a monocular camera to a given, preexisting 3D LiDAR-map.

Tables3

Table 1. TABLE I: Parameter Estimation

	Occlusion				Error
Backbone	K	Th	Mirroring	Rot. Loss	Transl.	Rot.
Regnet	-	-	✗	$D_{a}$	0.64m	1.67^∘
ResNet18	-	-	✗	$D_{a}$	0.60m	1.59^∘
PWC-Net	11	3.9999	✗	$D_{a}$	0.52m	1.50^∘
PWC-Net	13	3.9999	✗	$D_{a}$	0.51m	1.43^∘
PWC-Net	5	3.0	✗	$D_{a}$	0.47m	1.45^∘
PWC-Net	5	3.0	✓	$D_{a}$	0.46m	${1.36}^{\circ}$
PWC-Net	5	3.0	✓	$L_{1}$	0.46m	2.07^∘

Table 2. TABLE II: Iterative Pose Refinement

	Initial Error Range		Localization Error
	Transl. [m]	Rot. [deg]	Transl. [m]	Rot. [deg]
Iteration 1	[-2, +2]	[ $- 10$ , $+ 10$ ]	0.51	$1.39$
Iteration 2	[-1, +1]	[ $- 2$ , $+ 2$ ]	0.31	$1.09$
Iteration 3	[-0.6, +0.6]	[ $- 2$ , $+ 2$ ]	0.27	1.07

Table 3. TABLE III: Runtime Performances

	Z-Buffer	Occlusion Filter	CMRNet	Total
Time [ms]	8.6	1.4	4.6	14.7 ( $\sim$ 68Hz)

Equations9

H = (R_{(3, 3)} 0_{(1, 3)} T_{(3, 1)} 1) \in SE (3)

H = (R_{(3, 3)} 0_{(1, 3)} T_{(3, 1)} 1) \in SE (3)

p^{i} = K \cdot H_{ini t} \cdot P^{i}

p^{i} = K \cdot H_{ini t} \cdot P^{i}

L (I, D) = L_{t} (I, D) + L_{q} (I, D)

L (I, D) = L_{t} (I, D) + L_{q} (I, D)

L_{q} (I, D)

L_{q} (I, D)

D_{a} (m)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

CMRNet: Camera to LiDAR-Map Registration

D. Cattaneo1, M. Vaghi1 A. L. Ballardini2, S. Fontana1, D. G. Sorrenti1 W. Burgard3 2The work of A. L.Ballardini has been funded by European Union H2020, under GA Marie Skłodowska-Curie n. 754382 Got Energy. 1Università degli Studi di Milano - Bicocca, Milano, Italy

2Computer Science Department, Universidad de Alcalá, Alcalá de Henares, Spain

3Albert-Ludwigs-Universität Freiburg, Freiburg, Germany

Abstract

In this paper we present CMRNet, a realtime approach based on a Convolutional Neural Network (CNN) to localize an RGB image of a scene in a map built from LiDAR data. Our network is not trained in the working area, *i.e., *CMRNet does not learn the map. Instead it learns to match an image to the map. We validate our approach on the KITTI dataset, processing each frame independently without any tracking procedure. CMRNet achieves 0.27m and 1.07∘ median localization accuracy on the sequence 00 of the odometry dataset, starting from a rough pose estimate displaced up to 3.5m and 17∘. To the best of our knowledge this is the first CNN-based approach that learns to match images from a monocular camera to a given, preexisting 3D LiDAR-map.

©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

I Introduction

Over the past few years, the effectiveness of scene understanding for self-driving cars has substantially increased both for object detection and vehicle navigation [1, 2]. Even though these improvements allowed for more advanced and sophisticated Advanced Driver Assistance Systems (ADAS) and maneuvers, the current state of the art is far from the SAE full-automation level, especially in complex scenarios such as urban areas. Most of these algorithms depend on very accurate localization estimates, which are often hard to obtain using common Global Navigation Satellite Systems, mainly for non-line-of-sight (NLOS) and multipath issues. Moreover, applications that require navigation in indoor areas, *e.g., *valet parking in underground areas, necessarily require complementary approaches.

Different options have been investigated to solve the localization problem, including approaches based on both vision and Light Detection And Ranging (LiDAR); they share the exploitation of an a-priori knowledge of the environment in the localization process [3, 4, 5]. Localization approaches that utilize the same sensor for mapping and localization usually achieve good performances, as the map of the scene is matched to the same kind of data generated by the on-board sensor. However, their application is hampered by the need for a preliminary mapping of the working area, which represents a relevant issue in terms of effort both for building the maps as well as for their maintenance.

On the one hand, some approaches try to perform the localization exploiting standard cartographic maps, such as OpenStreetMap or other topological maps, leveraging the road graph [6] or high-level features such as lane, roundabouts, and intersections [7, 8, 9]. On the other hand, companies in the established market of maps and related services, like *e.g., *HERE or TomTom, are nowadays already developing so-called High Definition maps, which are built using LiDAR sensors [10]. This allows other players in the autonomous cars domain, to focus on the localization task.

HD maps, which are specifically designed to support self-driving vehicles, provide an accurate position of high-level features such as traffic signs, lane markings, etc. as well as a representation of the environment in terms of point clouds, with a density of points usually reaching 0.1m. In the following, we denote as LiDAR-maps the point clouds generated by processing data from LiDARs.

Standard approaches to exploit such maps localize the observer by matching point clouds gathered by the on-board sensor to the LiDAR-map; solutions to this problem are known as point clouds registration algorithms. Currently, these approaches are hampered by the huge cost of LiDAR devices, the de-facto standard for accurate geometric reconstruction.

In contrast, we here propose a novel method for registering an image from an on-board monocular RGB camera to a LiDAR-map of the area. This allows for the exploitation of the forthcoming market of LiDAR-maps embedded into HD maps using only a cheap camera-based sensor suite on the vehicle.

In particular, we propose CMRNet, a CNN-based approach that achieves camera localization with sub-meter accuracy, basing on a rough initial pose estimate. The maps and images used for localization are not necessarily those used during the training of the network. To the best of our knowledge, this is the first work to tackle the localization problem without a localized CNN, *i.e., *a CNN trained in the working area [11]. CMRNet does not learn the map, instead, it learns to match images to the LiDAR-map. Extensive experimental evaluations performed on the KITTI datasets [12] show the feasibility of our approach.

The remainder of the paper is organized as follows: Section II gives a short review of the most similar methods and the last achievements with DNN-based approaches. In Section III we present the details of the proposed system. In Section IV we show the effectiveness of the proposed approach, and Sections V and VI present our conclusions and future work.

II Related work

In the last years, visual localization has been a trending topic in the computer vision community. Although many variations have been proposed, most of them are either based on images gathered from a camera sensor only or exploit some kind of 3-dimensional reconstruction of the environment.

II-A Camera-only approaches

The first category of techniques deals with the 6-DoF estimate of the camera pose using a single image as input. On the one hand, traditional methods face this problem by means of a two-phase procedure that consists of a coarse localization, performed using a place recognition algorithm, followed by a second refining step that allows for a final accurate localization [13, 14]. On the other hand, the latest machine learning techniques, mainly based on deep learning approaches, face this task in a single step. These models are usually trained using a set of images taken from different points of view of the working environment, in which the system performs the localization. One of the most important approaches of this category, which inspired many subsequent works, is PoseNet [11]. It consists in a CNN trained for camera pose regression. Starting from this work, additional improvements have been proposed by introducing new geometric loss functions [15], by exploiting the uncertainty estimation of Bayesian CNNs [16], by including a data augmentation scheme based on synthetic depth information [17], or using the relative pose between two observations in a CNNs pipeline [18]. One of the many works that follow the idea presented in PoseNet is VLocNet++ [19]. Here the authors deal with the visual localization problem using a multi-learning task (MLT) approach. Specifically, they proved that training a CNN for different tasks at the same time yields better localization performances than single task learning. As for today, the literature still sees [19] as the best performing approach on the 7Scenes dataset [20]. Clark et al.[21] developed a CNN that exploits a sequence of images in order to improve the quality of the localization in urban environments. Brachmann et al., instead, integrated a differentiable version of RANSAC within a CNN-based approach in an end-to-end fashion [22, 23]. Another camera-only localization is based on decision forests, which consists of a set of decision trees used for classification or regression problems. For instance, the approach proposed by Shotton et al. [20] exploits RGBD images and regression forests to perform indoor camera localization. The aforementioned techniques, thanks to the generalization capabilities of machine learning approaches, are more robust against challenging scene conditions like lighting variations, occlusions, and repetitive patterns, in comparison with methods based on hand-crafted descriptors, such as SIFT [24], or SURF [25]. However, all these methods cannot perform localization in environments that have not been exploited in the training phase, therefore these regression models need to be retrained for every new place.

II-B Camera and LiDAR-map approaches

The second category of localization techniques leverages existing maps, in order to solve the localization problem. In particular, two classes of approaches have been presented in the literature: geometry-based and projection-based methods. Caselitz et al.[3] proposed a geometry-based method that solves the visual localization problem by comparing a set of 3D points, the point cloud reconstructed from a sequence of images and the existing map. Wolcott et al.[4], instead, developed a projection-based method that uses meshes built from intensity data associated to the 3D points of the maps, projected into an image plane, to perform a comparison with the camera image using the Normalized Mutual Information (NMI) measure. Neubert et al.[5] proposed to use the similarity between depth images generated by synthetic views and the camera image as a score function for a particle filter, in order to localize the camera in indoor scenes.

The main advantage of these techniques is that they can be used in any environment for which a 3D map is available. In this way, they avoid one of the major drawbacks of machine learning approaches for localization, *i.e., *the necessity to train a new model for every specific environment. Despite these remarkable properties, their localization capabilities are still not robust enough in the presence of occlusions, lighting variations, and repetitive scene structures.

The work presented in this paper has been inspired by Schneider et al.[26], which used 3D scans from a LiDAR and RGB images as the input of a novel CNN, RegNet. Their goal was to provide a CNN-based method for calibrating the extrinsic parameters of a camera w.r.t. a LiDAR sensor. Taking inspiration from that work, in this paper we propose a novel approach that has the advantages of both the categories described above. Differently from the aforementioned literature contribution, which exploits the data gathered from a synchronized single activation of a 3D LiDAR and a camera image, the inputs of our approach are a complete 3D LiDAR map of the environment, together with a single image and a rough initial guess of the camera pose. Eventually, the output consists of an accurate 6-DoF camera pose localization. It is worth to notice that having a single LiDAR scan taken at the same time as the image imply that the observed scene is exactly the same. In our case, instead, the 3D map usually depicts a different configuration, *i.e., *road users are not present, making the matching more challenging.

Our approach combines the generalization capabilities of CNNs, with the ability to be used in any environment for which a LiDAR-map is available, without the need to re-train the network.

III Proposed Approach

In this work, we aim at localizing a camera from a single image in a 3D LiDAR-map of an urban environment. We exploit recent developments in deep neural networks for both pose regression [11] and feature matching [27].

The pipeline of our approach is depicted in Fig. 1 and can be summarized as follows. First, we generate a synthesized depth image by projecting the map points into a virtual image plane, positioned at the initial guess of the camera pose. This is done using the intrinsic parameters of the camera. From now on, we will refer to this synthesized depth image as LiDAR-image. The LiDAR-image, together with the RGB image from the camera, are fed into the proposed CMRNet, which regresses the rigid body transformation $H_{out}$ between the two different points of view. From a technical perspective, applying $H_{out}$ to the initial pose $H_{init}$ allows us to obtain the 6-DoF camera localization.

In order to represent a rigid body transformation, we use a $(4,4)$ homogeneous matrix:

[TABLE]

Here, $\bf{R}$ is a $(3,3)$ rotation matrix and T is a $(3,1)$ translation vector, in cartesian coordinates. The rotation matrix is composed of nine elements, but, as it represents a rotation in the space, it only has three degrees of freedom. For this reason, the output of the network in terms of rotations is expressed using quaternions lying on the 3-sphere ( $S^{3}$ ) manifold. On the one hand, even though normalized quaternions have one redundant parameter, they have better properties than Euler angles, *i.e., *gimbal lock avoidance and unique rotational representation (except that conjugate quaternions represent the same rotation). Moreover, they are composed of fewer elements than a rotation matrix, thus being better suited for machine learning regression approaches.

The outputs of the network are then a translation vector $\textbf{T}\in\mathbb{R}^{3}$ and a rotation quaternion $\textbf{q}\in S^{3}$ . For simplicity, we will refer to the output of the network as $H_{out}$ , implying that we convert T and q to the corresponding homogeneous transformation matrix, as necessary.

III-A LiDAR-Image Generation

In order to generate the LiDAR-image for a given initial pose $H_{init}$ , we follow a two-step procedure.

Map Projection. First, we project all the 3D points in the map into a virtual image plane placed at $H_{init}$ , *i.e., *compute the image coordinates $p$ of every 3D point $P$ . This mapping is shown in Equation 2, where $K$ is the camera projection matrix.

[TABLE]

The LiDAR-image is then computed using a z-buffer approach to determine the visibility of points along the same projection line. Since Equation 2 can be computationally expensive for large maps, we perform the projection only for a sub-region cropped around $H_{init}$ , ignoring also points that lay behind the virtual image plane. In Figure 2a is depicted an example of LiDAR-image.

Occlusion Filtering. The projection of a point cloud into an image plane can produce unrealistic depth images. For instance, the projection of occluded points, *e.g., *laying behind a wall, is still possible due to the sparsity nature of point clouds. To avoid this problem, we adopt the point clouds occlusion estimation filter presented in [28]; an example of the effect of this approach is depicted in Figure 2b. For every point $P_{i}$ , we can build a cone, about the projection line towards the camera, that does not intersect any other point. If the cone has an aperture larger than a certain threshold Th, the point $P_{i}$ is marked as visible. From a technical perspective, for each pixel with a non-zero depth $p_{j}$ in the LiDAR-image, we compute the normalized vector $\vec{v}$ from the relative 3D point $P_{j}$ to the pin-hole. Then, for any 3D point $P_{i}$ whose projection lays in a neighborhood (of size KxK) of $p_{j}$ , we compute the vector ${\vec{c}=\frac{P_{i}-P_{j}}{\|P_{i}-P_{j}\|}}$ and the angle between the two vectors ${\vartheta=\arccos(\vec{v}\cdot\vec{c})}$ . This angle is used to assess the visibility of $P_{j}$ . Occluded pixels are then set to zero in the LiDAR-image. More detail is available in [28]

III-B Network Architecture

PWC-Net [27] was used as baseline, and we then made some changes to its architecture. We chose this network because PWC-Net has been designed to predict the optical flow between a pair of images, *i.e., *to find matches between them. Starting from a rough camera localization estimate, our insight is to exploit the correlation layer of PWC-Net and its ability to match features from different points of view to regress the correct 6-DoF camera pose.

We applied the following changes to the original architecture.

•

First, as our inputs are a depth and an RGB image (instead of two RGB images), we decoupled the feature pyramid extractors by removing the weights sharing.

•

Then, as we aim to perform pose regression, we removed the up-sampling layers, attaching the fully connected layers just after the first cost volume layer.

Regarding the regression part, we added one fully connected layer with 512 neurons before the first optical flow estimation layer (conv6_4 in PWC-Net), followed by two branches for handling rotations and translations. Each branch is composed of two stacked fully connected layers, the first with 256 while the second with 3 or 4 neurons, for translation and rotation respectively.

Given an input pair composed of a RGB image $\mathcal{I}$ and a LiDAR-image $\mathcal{D}$ , we used the following loss function in Equation 3, where $\mathcal{L}_{t}(\mathcal{I},\mathcal{D})$ is the translation loss and $\mathcal{L}_{q}(\mathcal{I},\mathcal{D})$ is the rotation loss.

[TABLE]

For the translation we used a $\mbox{smooth}_{L1}$ loss [29]. Regarding the rotation loss, since the Euclidean distance does not provide a significant measure to describe the difference between two orientations, we used the angular distance between quaternions, as defined below:

[TABLE]

Here, $q$ is the ground truth rotation, $\tilde{q}$ represents the predicted normalized rotation, $inv$ is the inverse operation for quaternions, $\{a_{m},b_{m},c_{m},d_{m}\}$ are the components of the quaternion $m$ and $*$ is the multiplicative operation of two quaternions.

In order to use Equation (5) as a loss function, we need to ensure that it is differentiable for every possible output of the network. Recalling that $atan2(y,x)$ is not differentiable for ${y=0\land x\leq 0}$ , and the fact that $m$ is a unit quaternion, we can easily verify that Equation (5) is differentiable in $S^{3}$ .

III-C Iterative refinement

When the initial pose strongly deviates with respect to the camera frame, the map projection produces a LiDAR-image that shares just a few correspondences with the camera image. In this case, the camera pose prediction task is hard, because the CNN lacks the required information to compare the two points of view. It is therefore quite likely that the predicted camera pose is not accurate enough. Taking inspiration from [26], we propose an iterative refinement approach. In particular, we trained different CNNs by considering descending error ranges for both the translation and rotation components of the initial pose. Once a LiDAR-image is obtained for a given camera pose, both the camera and the LiDAR-image are processed, starting from the CNN that has been trained with the largest error range. Then, a new projection of the map points is performed, and the process is repeated using a CNN trained with a reduced error range. Repeating this operation n times is possible to improve the accuracy of the final localization. The improvement is achieved thanks to the increasing overlap between the scene observed from the camera and the scene projected in the $n^{th}$ LiDAR-image.

III-D Training details

We implemented CMRNet using the PyTorch library [30], and a slightly modified version of the official PWC-Net implementation. Regarding the activation function, we used a leaky RELU (REctified Linear Unit) with a negative slope of $0.1$ as non-linearity. Finally, CMRNet was trained from scratch for 300 epochs using the ADAM optimizer with default parameters, a batch size of $24$ and a learning rate of $1e^{-4}$ on a single NVidia GTX 1080ti.

IV Experimental results

This section describes the evaluation procedure we adopted to validate CMRNet, including the used dataset, the assessed system components, the iterative refinements and finally the generalization capabilities.

We wish to emphasize that, in order to assess the performance of CMRNet itself, in all the performed experiments each input was processed independently, *i.e., *without any tracking or temporal integration strategy.

IV-A Dataset

We tested the localization accuracy of our method on the KITTI odometry dataset. Specifically, we used the sequences from 03 to 09 for training (11697 frames) and the sequence 00 for validating (4541 frames). Note that the validation set is spatially separated from the train set, except for a very small sub-sequence (approx 200 frames), thus it is fair to say that the network is tested in scenes never seen during the training phase. Since the accuracy of the provided GPS-RTK ground truth is not sufficient for our task (the resulting map is not aligned nearby loop closures), we used a LiDAR-based SLAM system to obtain consistent trajectories. The resulting poses are used to generate a down-sampled map with a resolution of 0.1m. This choice is the result of our expectations on the format of HD-maps that will be soon available from map providers [10].

Since the images from the KITTI dataset have different sizes (varying from 1224x370 to 1242x376), we padded all images to 1280x384, in order to match the CNN architecture requirement, *i.e., *width and height multiple of 64. Note that we first projected the map points into the LiDAR-image and then we padded both RGB and LiDAR-image, in order not to modify the camera projection parameters.

To simulate a noisy initial pose estimate $H_{init}$ , we applied, independently for each input, a random translation, and rotation to the ground truth camera pose. In particular, for each component, we added a uniformly distributed noise in the range of [-2m, +2m] for the translation and [ $-10^{\circ}$ , $+10^{\circ}$ ] for the rotation.

Finally, we applied the following data augmentation scheme: first, we randomly changed the image brightness, contrast and saturation (all in the range [0.9, 1.1]). Then we randomly mirrored the image horizontally, and last we applied a random image rotation in the range [ $-5^{\circ}$ , $+5^{\circ}$ ] along the optical axis. The 3D point cloud was transformed accordingly.

Both data augmentation and the selection of $H_{init}$ take place at run-time, leading to different LiDAR-images for the same RGB image across epochs.

IV-B System Components Evaluation

We evaluated the performances of CMRNet by assessing the localization accuracy, varying different sub-components of the overall system. Among them, the most significative are shown in Table I, and derive from the following operational workflow.

First, we evaluated the best CNN to be used as backbone, comparing the performances of state-of-the-art approaches, namely PWC-Net, ResNet18 and RegNet [27, 31, 26]. According to the performed experiments, PWC-Net maintained a remarkable superiority with respect to RegNet and ResNet18 and therefore was chosen as a starting point for further evaluation.

Thereafter, we estimated the effects in modifying both inputs, *i.e., *camera images and LiDAR-images. In particular, we added a random image mirroring and experimented different parameter values influencing the effect of the occlusion filtering presented in Section III-A, *i.e., *size K and threshold Th.

At last, the effectiveness of the rotation loss proposed in Equation 5 was evaluated with respect to the commonly used $L_{1}$ loss. The proposed loss function achieved a relative decrease of rotation error of approx. 35%.

The noise added to the poses in the validation set was kept fixed on all the experiments, allowing for a fair comparison of the performances.

IV-C Iterative Refinement and Overall Assessment

In order to improve the localization accuracy of our system, we tested the iterative approach explained in Section III-C. In particular, we trained three instances of CMRNet varying the maximum error ranges of the initial camera poses. To assess the robustness of CMRNet, we repeated the localization process for 10 times using different initial noises. The averaged results are shown in Table II together with the correspondent ranges used for training each network.

Moreover, in order to compare the localization performances with the state-of-the-art monocular localization in LiDAR maps [3], we calculated mean and standard deviation for both rotation and translation components over 10 runs on the sequence 00 of the KITTI odometry dataset. Our approach shows comparable values for the translation component ( $0.33\pm 0.22$ m w.r.t. $0.30\pm 0.11$ m), with a lower rotation errors ( $1.07\pm 0.77^{\circ}$ w.r.t. $1.65\pm 0.91^{\circ}$ ). Nevertheless, it is worth to note that our approach still does not take advantage of any pose tracking procedure nor multi-frame analysis.

Some qualitative examples of the localization capabilities of CMRNet with the aforementioned iteration scheme are depicted in Figure 3.

In Figure 4 we illustrate the probability density functions (PDF) of the error, decomposed into the six components of the pose, for the three iterations of the aforementioned refinement. It can be noted that the PDF of even the first network iteration approximates a Gaussian distribution and following iterations further decrease the variance of the distributions.

An analysis of the runtime performances using this configuration is shown in Table III.

IV-D Generalization Capabilities

In order to assess the generalization effectiveness of our approach, we evaluated its localization performance using a 3D LiDAR-map generated on a different day with respect to the camera images, yet still of the same environment. This allows us to have a completely different arrangement of parked cars and therefore to stress the localization capabilities.

Unfortunately, there is only a short overlap between the sequences of the odometry dataset (approx. 200 frames), consisting of a small stretch of roads in common between sequences ”00” and ”07”. Even though we cannot completely rely on the results of this limited set of frames, CMRNet achieved 0.57m and 0.9∘ median localization accuracy on this test.

Indeed, it is worth to notice that the network was trained with maps representing the same exact scene of the respective images, *i.e., *with cars parked in the same parking spots, and thus cannot learn to ignore cluttering scene elements.

V Conclusions

In this work we have described CMRNet, a CNN based approach for camera to LiDAR-Map registration, using the KITTI dataset for both learning and validation purposes. The performances of the proposed approach allow multiple specialized CMRNet to be stacked as to improve the final camera localization, yet preserving realtime requirements. The results have shown that our proposal is able to localize the camera with a median of less than $0.27$ m and $1.07^{\circ}$ . Preliminary and not reported experiments on other datasets suggests there is room for improvement and the reason seems to be due to the limited vertical field-of-view available for the point clouds.Since our method does not learn the map but learn how to perform the registration, it is suitable for being used with large-scale HD-Maps.

VI Future Works

Even though our approach does not embed any information of specific maps, a dependency on the intrinsic camera calibration parameters still holds. As part of the future works we plan to increase the generalization capabilities so to not directly depend from a specific camera calibration. Finally, since the error distributions reveal a similarity with respect to Gaussian distributions, we expect to be able to benefit from standard filtering techniques aimed to probabilistically tackle the uncertainties over time.

Acknowledgments

The authors would like to thank Tim Caselitz for his contribution related to the ground truth SLAM-based trajectories for the KITTI sequences and Pietro Colombo for the help in the editing of the associated video.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] V. Vaquero, I. del Pino, F. Moreno-Noguer, J. Solà, A. Sanfeliu, and J. Andrade-Cetto, “Deconvolutional networks for point-cloud vehicle detection and tracking in driving scenarios,” in 2017 European Conference on Mobile Robots (ECMR) , 2017, pp. 1–7.
2[2] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017.
3[3] T. Caselitz, B. Steder, M. Ruhnke, and W. Burgard, “Monocular camera localization in 3d lidar maps,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2016, pp. 1926–1931.
4[4] R. W. Wolcott and R. M. Eustice, “Visual localization within lidar maps for automated urban driving,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems , 2014, pp. 176–183.
5[5] P. Neubert, S. Schubert, and P. Protzel, “Sampling-based methods for visual navigation in 3d maps by synthesizing depth images,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2017, pp. 2492–2498.
6[6] I. P. Alonso, D. F. Llorca, M. Gavilan, S. A. Pardo, M. A. Garcia-Garrido, L. Vlacic, and M. A. Sotelo, “Accurate global localization using visual odometry and digital maps on urban environments,” IEEE Trans. Intell. Transp. Syst. , Dec 2012.
7[7] G. Cao, F. Damerow, B. Flade, M. Helmling, and J. Eggert, “Camera to map alignment for accurate low-cost lane-level scene interpretation,” Nov 2016.
8[8] M. Raaijmakers and M. E. Bouzouraa, “In-vehicle roundabout perception supported by a priori map data,” Sept 2015.