TL;DR
This paper introduces a deep generative model for LiDAR data that produces high-quality, structured 2D point maps, improving over existing methods and robustly handling noisy inputs by leveraging a novel data representation.
Contribution
It adapts deep generative models to LiDAR scan synthesis by unravelling scans into 2D maps and proposes a new data representation that enhances robustness and quality.
Findings
Significant improvements over state-of-the-art point cloud generation methods.
The model can recover LiDAR scans from noisy or incomplete data.
The proposed data representation improves robustness to input noise.
Abstract
Building models capable of generating structured output is a key challenge for AI and robotics. While generative models have been explored on many types of data, little work has been done on synthesizing lidar scans, which play a key role in robot mapping and localization. In this work, we show that one can adapt deep generative models for this task by unravelling lidar scans into a 2D point map. Our approach can generate high quality samples, while simultaneously learning a meaningful latent representation of the data. We demonstrate significant improvements against state-of-the-art point cloud generation methods. Furthermore, we propose a novel data representation that augments the 2D signal with absolute positional information. We show that this helps robustness to noisy and imputed input; the learned model can recover the underlying lidar scan from seemingly uninformative data
| Model | EMD | Chamfer |
|---|---|---|
| Random | 4331.9 | 253.6 |
| AtlasNet | 1571.2 | 2.85 |
| Ach. et al | 1103.1 | 2.16 |
| Ours(xyz) | 137.2 | 1.23 |
| Ours(pol) | 127.0 | 1.04 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Deep Generative Modeling of LiDAR Data
Lucas Caccia1,2, Herke van Hoof1,4, Aaron Courville2,3, Joelle Pineau1,2,3 1 MILA, McGill University2 MILA, Université de Montréal3 CIFAR Fellow4 University of Amsterdam
Abstract
Building models capable of generating structured output is a key challenge for AI and robotics. While generative models have been explored on many types of data, little work has been done on synthesizing lidar scans, which play a key role in robot mapping and localization. In this work, we show that one can adapt deep generative models for this task by unravelling lidar scans into a 2D point map. Our approach can generate high quality samples, while simultaneously learning a meaningful latent representation of the data. We demonstrate significant improvements against state-of-the-art point cloud generation methods. Furthermore, we propose a novel data representation that augments the 2D signal with absolute positional information. We show that this helps robustness to noisy and imputed input; the learned model can recover the underlying lidar scan from seemingly uninformative data.
I INTRODUCTION
One of the main challenges in mobile robotics is the development of systems capable of fully understanding their environment. This non-trivial task becomes even more complex when sensor data is noisy or missing. An intelligent system that can replicate the data generation process is much better equipped to tackle inconsistency in its sensor data. There is significant potential gain in having autonomous robots equipped with data generation capabilities which can be leveraged for reconstruction, compression, or prediction of the data stream.
In autonomous driving, information from the environment is captured from sensors mounted on the vehicle, such as cameras, radars, and lidars. While a significant amount of research has been done on generating RGB images, relatively little work has focused on generating lidar data. These scans, represented as an array of three dimensional coordinates, give an explicit topography of the vehicle’s surroundings, potentially leading to better obstacle avoidance, path planning, and inter-vehicle spatial awareness.
To this end, we leverage recent advances in deep generative modeling, namely variational autoencoders (VAE) [1] and generative adversarial networks (GAN) [2], to produce a generative model of lidar data. While the VAE and GAN approaches have different objectives, they can be used in conjunction with Convolutional Neural Networks (CNN) [3] to extract local information from nearby sensor points.
Unlike some approaches for lidar processing, we do not convert the data to voxel grids [4, 5]. Instead, we build off existing work [6] which projects the lidar scan into a 2D spherical point map. We show that this representation is fully compatible with deep architectures previously designed for image generation. Moreover, we investigate the robustness of this approach to missing or noisy data, a crucial property for real world applications. We propose a simple, yet effective way to improve the model’s performance when the input is degraded. Our approach consists of augmenting the 2D map with absolute positional information, through extra coordinate channels. We validate these claims through a variety of experiments on the KITTI [7] dataset.
Our contributions are the following:
- •
We provide a fully unsupervised method for both conditional and unconditional lidar generation.
- •
We establish an evaluation framework for lidar reconstruction, allowing the comparison of methods over a spectrum of different corruption mechanisms.
- •
We propose a simple technique to help the model process noisy or missing data.
II Related work
II-A Lidar processing using Deep Learning
The majority of papers applying deep learning methods to lidar data present discriminative models to extract relevant information from the vehicle’s environment. Dewan et al. [8] propose a CNN for pointwise semantic segmentation to distinguish between static and moving obstacles. Caltagirone et al. [9] use a similar approach to perform pixel-wise classification for road detection. To leverage the full 3D structure of the input, Bo Li [10] uses 3D convolutions on a voxel grid for vehicle detection. However processing voxels is computationally heavy, and does not leverage the sparsity of LiDAR scans. Engelcke et al. [5] propose an efficient 3D convolutional layer to mitigate these issues.
Another popular approach [6, 11, 12, 13] to avoid using voxels relies on the inherent two-dimensional nature of lidars. It consists of a bijective mapping from 3D point cloud to a 2D point map, where coordinates are encoded as azimuth and elevation angles measured from the origin. This can also be seen as projecting the point cloud onto a 2D spherical plane. Using such a bijection lies at the core of our proposed approach for generative modeling of lidar data.
II-B Grid-based lidar generation
An alternative approach for generative modeling of lidar data is from Ondru´ska et al [14]. They train a Recurrent Neural Network for semantic segmentation and convert their input to an occupancy grid. More relevant to our task, they train their network to also predict future occupancy grids, thereby creating a generative model for lidar data. Their approach differs from ours, as the occupancy grid used assigns a constant area (400 cm2) to every slot, whereas we operate directly on projected coordinates. This not only reduces preprocessing time, but also allows us to efficiently represent data with non-uniform spatial density. We can therefore run our model at a much higher resolution, while remaining computationally efficient.
Concurrent with our work, Tomasello et al. [15] explore conditional lidar synthesis from RGB images. The authors use the same 2D spherical mapping proposed in [6]. Our approach differs on several points. First, we do not require any RGB input for generation, which may not always be available (e.g. in poorly lit environments). Second, we explore ways to augment the lidar representation to increase robustness to corrupted data. Finally, we look at generative modeling of lidar data (compared to a deterministic mapping in their case).
II-C Point Cloud Generation
A recent line of work [16, 17, 18, 19] considers the problem of generating point clouds as unordered sets of coordinates. This approach does not define an ordering on the points, and must therefore be invariant to permutations. To achieve this, they use a variant of PointNet [20] to encode a variable-length point cloud into a fixed-length representation. This latent vector is then decoded back to a point cloud, and the whole network is trained using permutation invariant losses such as the Earth-Mover’s Distance or the Chamfer Distance [19]. While these approaches work well for arbitrary point clouds, we show that they give suboptimal performance on lidar, as they do not leverage the known structure of the data.
II-D Improving representations through extra coordinate channels
In this work, we propose to augment the 2D spherical signal with Cartesian coordinates. This can be seen as a generalization of the CoodConv solution [21]. The authors propose to add two channels to the image input, corresponding to the location of every pixel. They show that this enables networks to learn either complete translation invariance or varying degrees of translation dependence, leading to better performance on a variety of downstream tasks.
III Technical Background : Generative Modeling
The underlying task of generative models is density estimation. Formally, we are given a set of -dimensional i.i.d samples from some unknown probability density function . Our objective is to learn a density where represents the parameters of our estimator and a parametric family of models. Training is done by minimizing some distance between and . The choice of both and the training algorithm are the defining components of the density estimation procedure. Common choices for are either -divergences such as the Kullback-Liebler (KL) divergence, or Integral Probability Metrics (IPMs), such as the Wasserstein metric [22]. These similarity metrics between distributions often come with specific training algorithms, as we describe next.
III-A Maximum Likelihood Training
Maximum likelihood estimation (MLE) aims to find model parameters that maximize the likelihood of . Since samples are i.i.d, the optimization criterion can be viewed as :
[TABLE]
It can be shown that training with the MLE criteria converges to a minimization of the KL-divergence as the sample size increases [23]. From Eqn (1) we see that any model admitting a differentiable density can be trained via backpropagation. Powerful generative models trained via MLE include Variational Autoencoders [1] and autoregressive models [24]. In this work, we focus on the former, as the latter have slow sampling speed, limiting their potential use for real world applications.
III-A1 Variational Autoencoders (VAE)
The VAE [1] is a regularized version of the traditional autoencoder (AE). It consists of two parts: an inference network that maps an input x to a posterior distribution of latent codes , and a generative network that aims to reconstruct the original input conditioned on the latent encoding. By imposing a prior distribution on latent codes, it enforces the distribution over to be smooth and well-behaved. This property enables proper sampling from the model via ancestral sampling from latent to input space.
The full objective of the VAE is then:
[TABLE]
which is a valid lower bound on the true likelihood, thereby making Variational Autoencoders valid generative models. For a more in depth analysis of VAEs, see [25].
III-B Generative Adversarial Network (GAN)
The GAN [2] formulates the density estimation problem as a minimax game between two opposing networks. The generator maps noise drawn from a prior distribution to the input space, aiming to fool its adversary, the discriminator . The latter then tries to distinguish between real samples and fake samples . In practice, both models are represented as neural networks. Formally, the objective is written as
[TABLE]
GANs have shown the ability to produce more realistic samples [26] than their MLE counterparts. However, the optimization process is notoriously difficult; stabilizing GAN training is still an open problem. In practice, GANs can also suffer from mode collapse [27], which happens when the generator overlooks certain modes of the target distribution.
IV Proposed approach for lidar generation
We next describe the proposed deep learning framework used for generative modeling of lidar scans.
IV-A Data Representation
Our approach relies heavily on 2D convolutions, therefore we start by converting a lidar scan containing coordinates into a 2D grid. We begin by clustering together points emitted from the same elevation angle into clusters. Second, for every cluster, we sort the points in increasing order of azimuth angle. In order to have a proper grid with a fixed amount of points per row, we divide the [math] plane into bins. This yields a grid, where for each cell we store the average coordinate, such that we can store all the information in a tensor. We note that the default ordering in most lidar scanners is the same as the one obtained after applying this preprocessing. Therefore, sorting is not required in practice, and the whole procedure can be executed in . Figure 2 provides a visual representation of this mapping. This procedure yields the same ordering of points as the projection discussed in II-A. The latter would then return a grid of , where the channels are compressed as . We will refer to the two representations above as Cartesian and Polar respectively. While this small change in representation seems innocuous, we show that when the input is noisy or incomplete, this compression can lead to suboptimal performance.
IV-B Training Phase
IV-B1 VAEs
In practice, both encoder and decoder are represented as neural networks with parameters and respectively.
Similar to a traditional AE, the training procedure first encodes the data into a latent representation . The variational aspect is introduced by interpreting not as a vector, but as parameters of a posterior distribution. In our work we choose a Gaussian prior and posterior, and therefore decomposes as .
We then sample from this distribution and pass it through the decoder to obtain . Using the reparametrization trick [1], the network is fully deterministic and differentiable w.r.t its parameters and , which are updated via stochastic gradient descent (SGD).
IV-B2 GANs
Training alternates between updates for the generator and discriminator, with parameters and . Similarly to the VAE, samples are obtained by ancestral sampling from the prior through the generator. In the original GAN, the networks are updated according to Eqn. 3. In practice, we use the Relativistic Average GAN (RaGAN) objective [28], which is easier to optimize. Again, and are updated using SGD. For a complete hyperparameter list, we refer the reader to our publicly available source code.111https://www.github.com/pclucas14/lidar_generation
IV-C Model Architecture
Deep Convolutional GANs (DCGANs) [29] have shown great success in generating images. They use a symmetric architecture for the two networks: The generator consists of five transpose convolutions with stride two to upsample at each layer, and ReLU activations. The discriminator uses stride two convolutions to downsample the input, and Leaky ReLU activations. In both networks, Batch Normalization [30] is interleaved between convolution layers for easier optimization. We use this architecture for all our models: The VAE encoder setup is simply the first four layers of the discriminator, and the decoder’s architecture replicates the DCGAN generator. We note that for both models, more sophisticated architectures [31, 32] are fully compatible with our framework. We leave this line of exploration as future work.
V Experiments
This section provides a thorough analysis of the performance of our framework fulfilling a variety of tasks related to generative modeling. First, we explore conditional generation, where the model must compress and reconstruct a (potentially corrupted) lidar scan. We then look at unconditional generation. In this setting, we are only interested in producing realistic samples, which are not explicitly tied to a real lidar cloud.
V-A Dataset
We consider the point clouds available in the KITTI dataset [7]. We use the train/validation/test set split proposed by [33], which yields 40 000, 80 and 700 samples for train, validation and test sets. We use the preprocessing described in section IV-A to get a grid. For training we subsample from 10 Hz to 3 Hz since temporally adjacent frames are nearly identical.
V-B Baseline Models
Since, to the best of our knowledge, no work has attempted generative modeling of raw lidar clouds, we compare to our method models that operate on arbitrary point clouds. We first choose AtlasNet [17], which has shown strong modeling performance on the Shapenet [34] dataset. This network first encodes point clouds using a shared MLP network that operates on each point individually. A max-pooling operation is performed on the point axis to obtain a fixed-length global representation of the point cloud. In other words, the encoder treats each point independently of other points, without assuming an ordering on the set of coordinates. This makes the feature extraction process invariant to permutations of points. The decoder is given the encoder output along with coordinates of a 2D-grid, and attempts to fold this 2D-grid into a three-dimensional surface. The decoder also uses a MLP network shared across all points.
Similar to AtlasNet, we compare our model with the one from Achlioptas et al [16]. Only its decoder differs from AtlasNet; the model does not deform a 2D grid, but rather uses fully-connected layers to convert the latent vector into a point cloud, making it less parameter efficient.
Both networks are trained end-to-end using the Chamfer Loss [19], defined as
[TABLE]
where and are two sets of coordinates. We note again that this loss is invariant to the ordering of the output points. For both autoencoders, we regularize their latent space using a Gaussian prior to get a valid generative model.
V-C Conditional Generation
We proceed to test our approach in a conditional generation task. In this setting, we do not evaluate the GAN, as this family of model -in their original formulation- does not have an inference mechanism. In other words, we consider four models: our approach, using either the Cartesian or the Polar representation, and the two baselines above. Since we are not sampling, but rather reconstructing an input, we consider both VAE and AE variants of every model, and report the best performing one.
Formally, given a lidar cloud, we evaluate a model’s ability to reconstruct it from a compressed encoding. More relevant to real word applications, we look at how robust the model’s latent representation is to input perturbation. Specifically, we look at the two following corruption mechanisms:
- •
Additive Noise : we add Gaussian noise drawn from to the coordinates of the lidar cloud. For this process, we normalize each of the three dimension independently prior to noise addition. We experiment with varying levels of .
- •
Data Removal : We remove random points from the input lidar scan. Specifically, the probability of removing a point is modeled as a Bernoulli distribution parametrized by . We consider different values for .
V-D Unconditional Generation
For this section, we consider the GAN model introduced in section IV-C. Our goal is to train a model that can produce realistic samples. Having access to such a generator can lead to better simulator development, which are heavily used to train self-driving agents [35]. In this use case, an agent operating in an environment that lacks crispness will likely result in poor skill transfer to real world navigation. Since the use of GANs has been shown to produce more realistic samples than MLE based models on images [36], we hope to see similar results with our model in the case of LiDAR data.
Evaluation criteria: Rigorous quantitative evaluation of samples produced by GANs and generative models is an open research question. GANs trained on images have been evaluated by the Inception Score [27] and the Frechet Inception Distance (FID) [37]. Since there exists no standardized metric for unconditional generation of lidar clouds, we rely on visual inspection of samples for quality assessment.
V-D1 Evaluation criteria
To measure how close the reconstructed output is to the original point cloud, we use the Earth-Mover’s Distance [19]. It is defined as
[TABLE]
where is a bijection between the two sets.
The EMD gives the solution to the optimal transportation problem, which attempts to transform one point cloud into the other. Recent work [16] has shown that this metric correlates well with human evaluation, and does so better than the Chamfer Distance. Moreover, the Earth Mover’s Distance is sensitive to both global and local structure, and does not require points to be ordered. Additionally, training and evaluating models on the same metric can result in models overfitting to this criterion, at the expense of sample quality [38]. Nevertheless, we also provide results measured by the Chamfer Distance for completeness.
V-D2 Training Protocol
For every model considered, we perform the same hyperparameter search. We randomly select the learning rate, the latent dimension and the batch size from a predetermined set of values. This set of values is the same for all models to ensure fairness. This process is repeated for 10 different configurations, from which we choose the one obtaining the best performance on the validation set. We then proceed to evaluate this configuration on the test set according to the metrics described above. All models are trained end-to-end on the same dataset.
VI Results
In this section, we will first discuss results for conditional generation and subsequently evaluate results for unconditional generation of lidar images.
VI-A Conditional
In all conditional tasks, our proposed approach beats available baselines by a significant margin, both in terms of EMD, Chamfer Distance and visual inspection.
VI-A1 Reconstructing clean data
while the baseline models are able to reconstruct the global structure of the lidar scan, they are unable to recover the more fine grained detail of the input (see Fig.1). This suggests that leveraging the known structure of the lidar plays a key role in obtaining high quality reconstructions. Quantitative results are shown in Table I.
VI-A2 Reconstructing corrupted data
Next, we evaluate the proposed models on their ability to extract important information from corrupted lidar scans. As shown in Fig. 4, the proposed VAE correctly reconstructs the defining components of the original cloud, even if the given input is seemingly uninformative. We emphasize that our model was not trained with such corrupted data, therefore these results are quite surprising. Animations and additional reconstructions can be found here .
Moreover, we observe that as soon as the input is moderately noisy, the proposed Cartesian representation yields better performance. As seen in Fig. 3, this representation performs better than its Polar alternative over the majority of the graph. In addition, we observe a similar trend when points are randomly removed from the input, as shown in Fig. 3; when more than 15% of the points are missing, using coordinates performs favorably according to EMD. This result suggests that in this corruption regime, having access to absolute positional information provides a better signal to the model. Interesting future work would be to leverage the best of the two representations.
We note that the suboptimal performance of the baselines is mainly due to two factors. First, since points are encoded independently, only information about the global structure is kept, and local fine-grained details are neglected. Second, the Chamfer Distance used for training assumes that the point cloud has a uniform density, which is not the case for lidar scans.
VI-B Unconditional
We perform a visual inspection of generated samples, located in the leftmost column of Figure 5 (more samples are available here). We see that our model generates realistic samples. First, the scans have a well-defined global structure: an aerial view of the samples show points correctly aligned to model the structure of the road. Second, the samples share local characteristics of real data: the model correctly generates road obstacles, such as cars, or cyclists. This amounts to having locations with a dense aggregation of points, followed by a trailing area with almost no points, similar to the shadow of an object. Third, model respects the point density of the data, where the density is roughly inversely proportional to the distance from the origin. Lastly, our models show good sample diversity.
VI-B1 What is the GAN generating?
In order to better interpret samples from the unconditional generator, we try to match them to real data examples. We perform the following procedure: we encode every sample to a latent representation, given by the output of the third layer of our discriminator. We similarly encode random datapoints from the test set, and match the generated sample to the real datapoint yielding the smallest latent L2 loss. We show three examples of this matching in Figure 5. In the first row, we see the model generating a two layer roadside to the right, consisting of a long shrub, followed by a line of trees. On the second row, we find a large tilted object to the right, which matches a bus turning right. Finally, on the last row we see a sharp enclosing, corresponding to a driveway leading to a garage door.
VII Discussion and Future Work
In this work we introduced two generative models for raw lidars, a GAN and a VAE. We have shown that the proposed adversarial network can generate highly realistic data, and captures both local and global features of real lidar scans. The LiDAR-VAE successfully encodes and reconstructs lidar samples, and is highly robust to missing or inputed data. We demonstrate that when adding enough noise to render the scan uninformative to the human eye, the proposed VAE still extracts relevant information and generates the missing data. Our work in deep generative modeling of lidar enables concrete advancements in real life applications; the former model can help reduce the discrepancy between synthetic and real lidars in driving simulators, while the latter can be leveraged in deployed vehicles for reconstruction, compression, or prediction of the data stream.
Moreover, we proposed a simple way to encode absolute positional information in the lidar representation, and showed that this leads to better reconstructions when the input is noisy or incomplete. Interesting future work would be to see if this can also lead to improvements in standard lidar processing tasks.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” Proceedings of the 2nd International Conference on Learning Representations. , 2013.
- 2[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems , 2014, pp. 2672–2680.
- 3[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105.
- 4[4] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3D object detection,” ar Xiv preprint ar Xiv:1711.06396 , 2017.
- 5[5] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner, “Vote 3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks,” in 2017 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 1355–1361.
- 6[6] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3D lidar using fully convolutional network,” ar Xiv preprint ar Xiv:1608.07916 , 2016.
- 7[7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research , vol. 32, no. 11, pp. 1231–1237, 2013.
- 8[8] A. Dewan, G. L. Oliveira, and W. Burgard, “Deep semantic classification for 3D lidar data,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on . IEEE, 2017, pp. 3544–3549.
