Nesti-Net: Normal Estimation for Unstructured 3D Point Clouds using   Convolutional Neural Networks

Yizhak Ben-Shabat; Michael Lindenbaum; Anath Fischer

arXiv:1812.00709·cs.CV·December 4, 2018

Nesti-Net: Normal Estimation for Unstructured 3D Point Clouds using Convolutional Neural Networks

Yizhak Ben-Shabat, Michael Lindenbaum, Anath Fischer

PDF

1 Repo

TL;DR

Nesti-Net introduces a CNN-based method for estimating normals in unstructured 3D point clouds, utilizing a multi-scale local representation and a mixture-of-experts architecture to improve robustness and accuracy.

Contribution

The paper presents a novel local point cloud representation (MuPS) and a mixture-of-experts CNN architecture for normal estimation, enhancing robustness to noise and density variations.

Findings

01

Achieves state-of-the-art results on synthetic datasets.

02

Improves robustness to noise and point density variations.

03

Provides qualitative results on real scanned scenes.

Abstract

In this paper, we propose a normal estimation method for unstructured 3D point clouds. This method, called Nesti-Net, builds on a new local point cloud representation which consists of multi-scale point statistics (MuPS), estimated on a local coarse Gaussian grid. This representation is a suitable input to a CNN architecture. The normals are estimated using a mixture-of-experts (MoE) architecture, which relies on a data-driven approach for selecting the optimal scale around each point and encourages sub-network specialization. Interesting insights into the network's resource distribution are provided. The scale prediction significantly improves robustness to different noise levels, point density variations and different levels of detail. We achieve state-of-the-art results on a benchmark synthetic dataset and present qualitative results on real scanned scenes.

Tables6

Table 1. Table 1: Comparison of the RMS angle error for unoriented normal vector estimation of our Nesti-Net method to classic geometric methods (PCA [ 12 ] , Jet [ 7 ] ) with three scales, and deep learning methods (PCPNet [ 11 ] , HoughCNN [ 6 ] )

Aug.	Our Nesti-Net	PCA [12]			Jet [7]			PCPNet [11]		HoughCNN [6]
scale		small	med	large	small	med	large	ss	ms	ss	ms
\hlineB 2 None	6.99	8.31	12.29	16.77	7.60	12.35	17.35	9.68	9.62	10.23	10.02
Noise
$σ = 0.00125$	10.11	12.00	12.87	16.87	12.36	12.84	17.42	11.46	11.37	11.62	11.51
$σ = 0.006$	17.63	40.36	18.38	18.94	41.39	18.33	18.85	18.26	18.87	22.66	23.36
$σ = 0.012$	22.28	52.63	27.5	23.5	53.21	27.68	23.41	22.8	23.28	33.39	36.7
Density
Gradient	9.00	9.14	12.81	17.26	8.49	13.13	17.8	13.42	11.7	11.02	10.67
Stripes	8.47	9.42	13.66	19.87	8.61	13.39	19.29	11.74	11.16	12.47	11.95
average	12.41	21.97	16.25	18.87	21.95	16.29	19.02	14.56	14.34	16.9	17.37

Table 2. Table 2: Comparison of the RMS angle error for unoriented normal vector estimation of our method using single-scale (SS), multi-scale (MS), multi-scale with switching (MS-Sw and multi-scale with mixture of experts (Nesti-Net)

Aug.	ss		ms	ms-sw	NestiNet
scale	0.01	0.05	0.01 0.05	0.01 0.05	0.01 0.05	0.01 0.03 0.05
\hlineB 2 None	9.32	12.73	10.83	7.88	7.76	6.99
Noise
$σ = 0.00125$	11.31	13.36	12.98	10.46	10.29	10.11
$σ = 0.006$	36.5	18.37	21.06	18.43	18.45	17.63
$σ = 0.012$	55.24	23.14	26.03	22.59	22.25	22.28
Density
Gradient	16.61	14.65	12.81	11.89	9.44	9.00
Stripes	14.5	14.57	12.97	10.06	9.65	8.47
average	23.91	16.14	16.11	13.55	12.97	12.41

Table 3. Table 3: Ablation architecture details for single-scale (ss) and multi-scale (ms).

ss	ms
3D Inception(3,5,128)	3D Inception(3,5,128)
3D Inception(3,5,256)	3D Inception(3,5,256)
3D Inception(3,5,256)	3D Inception(3,5,256)
maxpool(2,2,2)	maxpool(2,2,2)
3D Inception(3,5,512)	3D Inception(3,4,512)
3D Inception(3,5,512)	3D Inception(3,4,512)
maxpool(2,2,2)	maxpool(2,2,2)
FC(1024)	FC(1024)
FC(256)	FC(256)
FC(128)	FC(128)
FC(3)	FC(3)

Table 4. Table 4: Ablation architecture details for multi-scale with switching. First the noise is estimated and then the input is fed into the corresponding scale network according to a threshold. The network for both scales is constructed identically.

ms-sw
noise estimation net	normal estimation net
3D Inception(3,5,128)	3D Inception(3,5,128)
3D Inception(3,5,256)	3D Inception(3,5,256)
3D Inception(3,5,256)	3D Inception(3,5,256)
maxpool(2,2,2)	maxpool(2,2,2)
3D Inception(3,5,512)	3D Inception(3,4,512)
3D Inception(3,5,512)	3D Inception(3,4,512)
maxpool(2,2,2)	maxpool(2,2,2)
FC(1024)	FC(1024)
FC(256)	FC(256)
FC(128)	FC(128)
FC(1)	FC(3)

Table 5. Table 5: Normal estimation results comparison using the PGP10 metric (higher is better).

Aug.	PCPNet [11]		Jet [7]			PCA [12]			NestiNet
Scale	ss	ms	small	med	large	small	med	large	MoE
None	0.8364	0.8404	0.8802	0.7509	0.6584	0.8686	0.7409	0.6606	0.9120
Noise
$σ = 0.00125$	0.8013	0.8031	0.7346	0.7447	0.6575	0.7712	0.7378	0.6598	0.8384
$σ = 0.006$	0.6667	0.6294	0.1006	0.6397	0.6311	0.1101	0.6402	0.6301	0.7164
$σ = 0.01$	0.5546	0.5124	0.0377	0.3827	0.547	0.04063	0.394	0.5462	0.6123
Density
Gradient	0.7801	0.8062	0.8848	0.7695	0.6401	0.8731	0.7624	0.6366	0.9003
Striped	0.7967	0.8076	0.8743	0.7504	0.6001	0.8609	0.7379	0.5879	0.8929
Average	0.7393	0.7332	0.5854	0.6730	0.6224	0.5874	0.6689	0.6202	0.8120

Table 6. Table 6: Normal estimation results comparison using the PGP5 metric (higher is better).

Aug.	PCPNet [11]		Jet [7]			PCA [12]			NestiNet
Scale	ss	ms	small	med	large	small	med	large	MoE
None	0.7078	0.6986	0.7905	0.6284	0.5395	0.7756	0.6192	0.5361	0.8057
Noise
$σ = 0.00125$	0.6245	0.5932	0.4132	0.6237	0.5377	0.4758	0.6157	0.5335	0.6611
$σ = 0.006$	0.4486	0.366	0.027	0.4152	0.4837	0.02998	0.42	0.4812	0.5618
$σ = 0.01$	0.3156	0.2482	0.0099	0.1462	0.3715	0.0104	0.154	0.3719	0.399
Density
Gradient	0.6065	0.6254	0.7883	0.6442	0.4976	0.7743	0.647	0.4894	0.7749
Striped	0.6126	0.6231	0.7753	0.6321	0.4598	0.7575	0.6174	0.4415	0.7676
Average	0.5526	0.5257	0.4674	0.5150	0.4816	0.4706	0.5122	0.4756	0.6617

Equations24

u_{k} (p) = \frac{1}{( 2 π ) ^{D /2} ∣ Σ _{k} ∣ ^{1/2}} exp {- \frac{1}{2} (p - μ_{k})^{'} Σ_{k}^{- 1} (p - μ_{k})} .

u_{k} (p) = \frac{1}{( 2 π ) ^{D /2} ∣ Σ _{k} ∣ ^{1/2}} exp {- \frac{1}{2} (p - μ_{k})^{'} Σ_{k}^{- 1} (p - μ_{k})} .

u_{λ} (p) = k = 1 \sum K w_{k} u_{k} (p) .

u_{λ} (p) = k = 1 \sum K w_{k} u_{k} (p) .

G_{F V_{λ}}^{X} = t = 1 \sum T L_{λ} \nabla_{λ} lo g u_{λ} (p_{t}),

G_{F V_{λ}}^{X} = t = 1 \sum T L_{λ} \nabla_{λ} lo g u_{λ} (p_{t}),

\mathscr{G}_{3DmFV_{\lambda}}^{X}=\left[\begin{array}[]{c}\left.\sum_{t=1}^{T}L_{\lambda}\nabla_{\lambda}\log u_{\lambda}(\bm{p}_{t})\right|_{\lambda=\alpha,\mu,\sigma}\\ \left.\max_{t}(L_{\lambda}\nabla_{\lambda}\log u_{\lambda}(\bm{p}_{t})\right|_{\lambda=\alpha,\mu,\sigma}\\ \left.\min_{t}(L_{\lambda}\nabla_{\lambda}\log u_{\lambda}(\bm{p}_{t}))\right|_{\lambda=\mu,\sigma}\end{array}\right]

\mathscr{G}_{3DmFV_{\lambda}}^{X}=\left[\begin{array}[]{c}\left.\sum_{t=1}^{T}L_{\lambda}\nabla_{\lambda}\log u_{\lambda}(\bm{p}_{t})\right|_{\lambda=\alpha,\mu,\sigma}\\ \left.\max_{t}(L_{\lambda}\nabla_{\lambda}\log u_{\lambda}(\bm{p}_{t})\right|_{\lambda=\alpha,\mu,\sigma}\\ \left.\min_{t}(L_{\lambda}\nabla_{\lambda}\log u_{\lambda}(\bm{p}_{t}))\right|_{\lambda=\mu,\sigma}\end{array}\right]

G_{α_{k}}^{X} = \frac{1}{w _{k}} t = 1 \sum T (γ_{t} (k) - w_{k}),

G_{α_{k}}^{X} = \frac{1}{w _{k}} t = 1 \sum T (γ_{t} (k) - w_{k}),

G_{μ_{k}}^{X} = \frac{1}{w _{k}} t = 1 \sum T γ_{t} (k) (\frac{p _{t} - μ _{k}}{σ _{k}}),

G_{μ_{k}}^{X} = \frac{1}{w _{k}} t = 1 \sum T γ_{t} (k) (\frac{p _{t} - μ _{k}}{σ _{k}}),

G_{σ_{k}}^{X} = \frac{1}{2 w _{k}} t = 1 \sum T γ_{t} (k) [\frac{( p _{t} - μ _{k} ) ^{2}}{σ _{k}^{2}} - 1] .

G_{σ_{k}}^{X} = \frac{1}{2 w _{k}} t = 1 \sum T γ_{t} (k) [\frac{( p _{t} - μ _{k} ) ^{2}}{σ _{k}^{2}} - 1] .

w_{k} = \frac{e x p ( α _{k} )}{\sum _{j = 1}^{K} e x p ( α _{j} )} .

w_{k} = \frac{e x p ( α _{k} )}{\sum _{j = 1}^{K} e x p ( α _{j} )} .

γ_{t} (k) = \frac{w _{k} u _{k} ( p _{t} )}{\sum _{j = 1}^{K} w _{j} u _{j} ( p _{t} )} .

γ_{t} (k) = \frac{w _{k} u _{k} ( p _{t} )}{\sum _{j = 1}^{K} w _{j} u _{j} ( p _{t} )} .

G_{F V_{λ}}^{X} \leftarrow \frac{1}{T} G_{F V_{λ}}^{X}, G_{3 D m F V_{λ}}^{X} \leftarrow \frac{1}{T} G_{3 D m F V_{λ}}^{X} .

G_{F V_{λ}}^{X} \leftarrow \frac{1}{T} G_{F V_{λ}}^{X}, G_{3 D m F V_{λ}}^{X} \leftarrow \frac{1}{T} G_{3 D m F V_{λ}}^{X} .

G_{M u P S}^{p} = (G_{3 D m F V}^{\tilde{X}_{1} (r_{1})}, ..., G_{3 D m F V}^{\tilde{X}_{n} (r_{n})}) .

G_{M u P S}^{p} = (G_{3 D m F V}^{\tilde{X}_{1} (r_{1})}, ..., G_{3 D m F V}^{\tilde{X}_{n} (r_{n})}) .

L = i = 1 \sum n q_{i} \cdot D_{N} = i = 1 \sum n q_{i} \frac{∥ N _{i} \times N _{GT} ∥}{∥ N _{i} ∥ \cdot ∥ N _{GT} ∥} .

L = i = 1 \sum n q_{i} \cdot D_{N} = i = 1 \sum n q_{i} \frac{∥ N _{i} \times N _{GT} ∥}{∥ N _{i} ∥ \cdot ∥ N _{GT} ∥} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sitzikbs/Nesti-Net
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Nesti-Net: Normal Estimation for Unstructured 3D Point Clouds using Convolutional Neural Networks

Yizhak Ben-Shabat

Mechanical Engineering

Techion IIT

Haifa, Israel

[email protected]

Michael Lindenbaum

Computer Science

Techion IIT

Haifa, Israel

[email protected]

Anath Fischer

Mechanical Engineering

Techion IIT

Haifa, Israel

[email protected]

Abstract

In this paper, we propose a normal estimation method for unstructured 3D point clouds. This method, called Nesti-Net, builds on a new local point cloud representation which consists of multi-scale point statistics (MuPS), estimated on a local coarse Gaussian grid. This representation is a suitable input to a CNN architecture. The normals are estimated using a mixture-of-experts (MoE) architecture, which relies on a data-driven approach for selecting the optimal scale around each point and encourages sub-network specialization. Interesting insights into the network’s resource distribution are provided. The scale prediction significantly improves robustness to different noise levels, point density variations and different levels of detail. We achieve state-of-the-art results on a benchmark synthetic dataset and present qualitative results on real scanned scenes.

1 Introduction

Commodity 3D sensors are rapidly becoming an integral part of autonomous systems. These sensors, e.g. RGB-D cameras or LiDAR, provide a 3D point cloud representing the geometry of the scanned objects and surroundings. This raw representation is challenging to process since it lacks connectivity information or structure, and is often incomplete, noisy and contains point density variations. In particular, processing it by means of the highly effective convolutional neural networks (CNNs) is problematic because CNNs require structured, grid-like data as input.

When available, additional local geometric information, such as the surface normals at each point, induces a partial local structure and improves performance of different tasks such as over-segmentation [3], classification [21] and surface reconstruction [11].

Estimating the normals from a raw, point-only, cloud, is a challenging task due to difficulties associated with sampling density, noise, outliers, and detail level. The common approach is to specify a local neighborhood around a point and to fit a local basic geometric surface (e.g. a plane) to the points in this neighborhood. Then the normal is estimated from the fitted geometric entity. The chosen size (or scale) of the neighborhood introduces an unavoidable tradeoff between robustness to noise and accuracy of fine details. A large-scale neighborhood over-smoothes sharp corners and small details but is otherwise robust to noise. A small neighborhood, on the other hand, may reproduce the normals more accurately around small details but is more sensitive to noise. Thus it seems that an adaptive, data-driven scale may improve estimation performance.

We propose a normal estimation method for unstructured 3D point clouds. It features a mixture-of-experts network for scale prediction, which significantly increases its robustness to different noise levels, outliers, and varying levels of detail. In addition, this method overcomes the challenge of feeding point clouds into CNNs by extending the recently proposed 3D modified Fischer Vector (3DmFV) representation [4] to encode local geometry on a coarse multi-scale grid. It outperforms state-of-the-art methods for normal vector estimation.

The main contributions of this paper are:

•

A new normal estimation method for unstructured 3D point clouds based on mixture of experts and scale prediction.

•

A local point representation which can be used as input to a CNN.

2 Related-work

2.1 Deep learning for unstructured 3D point clouds

The point cloud representation is challenging for deep learning methods because it is both unstructured and point-wise unordered. In addition, the number of points in the point cloud is usually not constant. Several methods were proposed to overcome these challenges. Voxel-based methods embed the point cloud into a voxel grid but suffer from several accuracy-complexity tradeoffs [16]. The PointNet approach [20, 21] applies a symmetric, order-insensitive, function on a high-dimensional representation of individual points. The Kd-Network [14] imposes a kd-tree structure on the points and uses it to learn shared weights for nodes in the tree. The recently proposed 3DmFV [4] represents the points by their deviation from a Gaussian Mixture Model (GMM) whose Gaussians are uniformly positioned on a coarse grid.

In this paper, we propose a point-wise and multi-scale variation of 3DmFV. Instead of generating a structured representation for the entire point cloud, we represent each point and its neighbors within several scales.

2.2 Normal estimation

A classic method for normal estimation uses Principal Component Analysis (PCA) [12]. It first specifies the neighbors within some scale, and then uses PCA regression to estimate a tangent plane. Variants fitting local spherical surfaces [10] or jets [7] (truncated Taylor expansion) were also proposed. To be robust to noise, these methods usually choose a large-scale neighborhood, leading them to smooth sharp features and to fail in estimating normals near edges. Computing the optimal neighborhood size can decrease the estimation error [18] but requires the (usually unknown) noise standard deviation value and a costly iterative process to estimate the local curvature and additional density parameters.

Other approaches rely on using Voronoi cells of point clouds [2, 17, 9]. These methods are characterized by robustness to sharp features but are sensitive to noise. To overcome this challenge, Alliez et al. [1] proposed PCA-Voronoi approach to create cell sets which group adjacent cells to provide some control over smoothness. While many of these methods hold theoretical guarantees on approximation and robustness, in practice they rely on a preprocessing stage in the presence of strong or unstructured noise in addition to a fine-tuned set of parameters.

A few deep learning approaches have been proposed to estimate normal vectors from unstructured point clouds. Boulch et al. proposed to transform local point cloud patches into a 2D Hough space accumulator by randomly selecting point triplets and voting for that plane’s normal. Then, the normal is estimated from the accumulator by designing explicit criteria [5] for bin selection or, more recently, by training a 2D CNN [6] to estimate it continuously as a regression problem. This method does not fully utilize the 3D information since it loses information during the transformation stage. We reffer to this method as HoughCNN in the evaluation section. A more recent method, PCPNnet [11], uses a PointNet [20] architecture on local point neighborhoods of multiple scales. It achieves good normal estimation performance and has been extended to estimating other surface properties. However, it processes the multi-scale point clouds jointly and does not select an optimal scale. This type of architecture tends to encourage averaging during training rather than specialization [13].

In this paper we propose a method that approximates the local normal vector using a point-wise, multi-scale 3DmFV representation which serves as an input to a deep 3D CNN architecture. In addition, we learn the neighborhood size that minimizes the normal estimation error using a mixture of experts (MoE) [13], which encourages specialization.

2.3 Representing point clouds using 3DmFV

The 3DmFV representation for point clouds [4] achieved good results for point cloud classification using a 3D CNN. See Section 3.1 for details. In this paper we propose the Multi-scale Point Statistics (MuPS) representation, which extends the 3DmFV and computes a point-wise multi-scale 3DmFV.

3 Approach

The proposed method is outlined in Figure 1. It receives a 3D point cloud as input and consists of two main stages. In the first stage, we compute a multiscale point representation, denoted MuPS. In the second stage we feed it into a mixture-of-experts (MoE) CNN architecture and estimate the normal at each point as output. The stages are detailed below.

3.1 MuPS - Multi-scale point statistics

MuPS is a local multi-scale representation which computes point statistics on a coarse Gaussian grid. It builds on the well-known Fisher Vector (FV) [22], and the recently proposed 3DmFV representation [4]. Therefore, we first outline the FV and the 3DmFV, and then continue to the MuPS representation and its attractive properties.

FV and 3DmFV for 3D point clouds: Given a set of $T$ 3D points $X=\{\bm{p}_{t}\in\mathbb{R}^{3},t=1,...T\}$ and a set of parameters for a $K$ component GMM, $\lambda=\{(w_{k},\mu_{k},\Sigma_{k}),k=1,...K\}$ , where $w_{k},\mu_{k},\Sigma_{k}$ are the mixture weight, center, and covariance matrix of the $k$ -th Gaussian. The likelihood of a single 3D point $\bm{p}$ associated with the $k$ -th Gaussian density is

[TABLE]

Therefore, the likelihood of a single point associated with the GMM density is:

[TABLE]

The 3DmFV uses a uniform GMM on a coarse ${m\times m\times m}$ 3D grid, where $m$ is an adjustable parameter usually chosen to be from $m=3$ to $8$ . The weights are set to be $w_{k}=\frac{1}{K}$ , the standard deviation is set to be $\sigma_{k}=\frac{1}{m}$ , and the covariance matrix to be $\Sigma_{k}=\sigma_{k}I$ . Although the parameters in GMMs are usually set using maximum likelihood estimation, here uniformity is crucial for shared weight filtering (convolutions).

The FV is expressed as the sum of normalized gradients for each point $p_{t}$ . The 3DmFV is specified similarly using additional symmetric functions, i.e. $min$ and $max$ . They are symmetric in the sense proposed in [20] and are therefore adequate for representing the structureless and orderless set of points. Adding these functions makes the representation more informative and the associated classification more accurate [4]:

[TABLE]

where $L_{\lambda}$ is the square root of the inverse Fisher Information Matrix, and the normalized gradients are:

[TABLE]

Here, we follow [15] and ensure that $u_{\lambda}(x)$ is a valid distribution by changing the variable $w_{k}$ to $\alpha_{k}$ to simplify the gradient calculation using :

[TABLE]

In addition, $\gamma_{t}(k)$ expresses the soft assignment of point $\bm{p}_{t}$ to Gaussian $k$ , as obtained from the derivatives:

[TABLE]

The FV and 3DmFV are normalized by the number of points in order to be sample-size independent [22]:

[TABLE]

Note that these are global representations which are applied to the entire set, i.e., the entire point cloud.

MuPS definition : For each point $\bm{p}$ in point set $X$ we first extract $n$ point subsets $\tilde{X}_{i}(r_{i})|_{i=1,...n}\subset X$ which contain $T_{i}(\bm{p},r_{i})$ points and lie within a distance of radius $r_{i}$ from $\bm{p}$ . We refer to each of these subsets as a scale. Note that each scale may contain a different number of points. For scales with many points, we set a maximal point threshold, and sample a random subset of $T_{max}$ points for that scale. Here, $r_{i}$ and $T_{max}$ are design parameters. Next, the scales (subsets) are independently translated and uniformly scaled so that they fit into a zero-centered unit sphere with $\bm{p}$ mapped to the origin. Then, the 3DmFV representation is computed for each scale relative to a Gaussian grid positioned around the origin; see above. Concatenating the 3DmFVs of all scales yields the MuPS representation:

[TABLE]

MuPS properties: The MuPS representation overcomes the main challenges associated with feeding point clouds into CNNs. The symmetric functions make it independent of the number and order of points within each scale. In addition, the GMM gives it its grid structure, necessary for the use of CNNs. Furthermore, the multi-scale representation incorporates description of fine detail as well as robustness to noise.

3.2 The Nesti-Net architecture

The deep network architecture is outlined in Figure 1 (the green part). It is a mixture-of-experts architecture [13] which consists of two modules: a scale manager network module, and an experts module. The MoE architecture was chosen in order to overcome the averaging effect of typical networks when solving a regression problem.

Scale manager network: This module receives the MuPS representation as input and processes it using several 3D Inception inspired convolutions, and maxpool layers, followed by four fully connected layers, after which a softmax operator is applied. The architecture is specified in the top left part of Figure 2. The output is a vector of $n$ scalars $q_{i}$ , which can be intuitively interpreted as the probability of expert $i$ to estimate the normal correctly.

Experts: The normal is estimated using $n$ separate ”expert” networks. Each is a multi-layered 3D Inception inspired CNN followed by four fully connected layers. The MuPS representation is distributed to the experts. This distribution is a design choice. We obtained the best results when feeding each scale to two different experts in addition to one expert which receives the entire MuPS representation as input. Specifically, Nesti-Net uses 7 experts: experts 1-2 receive the smallest scale (1%), 3-4 the medium scale (3%), 5-6 the large scale (5%), and expert 7 receives all the scales. The last layer of each expert outputs a three-element vector $N_{i}=(N_{x},N_{y},N_{z})_{i}$ . The final predicted normal (for point $\bm{p}$ ) is $N_{argmax(q_{i})}$ , i.e., the normal associated with the expert expected to give the best results. The architecture is specified in the top right of Figure 2.

Loss function: We train the network to minimize the difference between a predicted normal $N_{i}$ and a ground truth normal $N_{GT}$ . This difference is quantified by the metric $D_{N}=\sin{\theta}$ , where the angle $\theta$ is the difference between the vectors, and $D_{N}$ is calculated as the magnitude of the cross product between these two vectors; see Eq. 12. In addition, to encourage specialization of each expert network, we follow [13] and minimize the loss:

[TABLE]

Using this loss, each expert is rewarded for specializing in a specific input type. Note that during training, all $n$ normal vectors are predicted and used to compute the loss and derivatives. However, at test time, we compute only one normal, which is associated with the maximal $q_{i}$ .

4 Evaluation

4.1 Datasets

For training and testing we used the PCPNet shape dataset [11]. The trainset consists of 8 shapes: four CAD objects (fandisk, boxunion, flower, cup) and four high quality scans of figurines (bunny, armadillo, dragon and turtle). All shapes are given as triangle meshes and densely sampled with 100k points. The data is augmented by introducing Gaussian noise for each point’s spacial location with a standard deviation of 0.012, 0.006, 0.00125 w.r.t the bounding box. This yields a set with 3.2M points and 3.2M corresponding training examples. The test set consists of 22 shapes, including figurines, CAD objects, and analytic shapes. For evaluation we use the same 5000 point subset per shape as in [11].

For qualitative testing on scanned data, we used the NYU Depth V2 dataset [19] and the recent ScanNet dataset [8], which include RGB-D images of indoor scenes.

4.2 Training details

All variations of our method were trained using 32,768 (1024 samples $\times 32$ shapes) random subsets of the 3.2M training samples at each epoch. For each point, we extract 512 neighboring points enclosed within a sphere of radius $r$ . For neighborhoods with more than 512 points, we perform random sampling, and for those with fewer points we use the maximum number of points available. For the MuPS representaiton we chose to use an $m=8$ Gaussian grid. We used Tensorflow on a single NVIDIA Titan Xp GPU.

4.3 Normal estimation performance

We use the RMS normal estimation error metric for comparing the proposed NestiNet to other deep learning based [11, 6] and geometric methods [12, 7]. We also analyze robustness for two types of data corruptions (augmentations):

•

Gaussian noise - perturbing the points with three levels of noise specified by $\sigma$ , given as a percentage of the bounding box.

•

Density variation - selecting a subset of the points based on two sampling regimes: gradient, simulating effects of distance from the sensor, and stripes, simulating occlusions.

For the geometric methods, we show results for three different scales: small, medium and large, which correspond to 18, 112, 450 nearest neighbors. For the deep learning based methods we show the results for the single-scale (ss) and multi-scale (ms) versions. Additional evaluation results using other metrics are available in the supplemental material.

Table 1 shows the unoriented normal estimation results for the methods detailed above. It can be seen that our method outperforms all other methods across all noise levels and most density variations. It also shows that both PCA and Jet perform well for specific noise-scale pairs. In addition, for PCPNet and HoughCNN, using a multi-scale approach only mildly improves performance.

Figure 3 illustrates Nesti-Net’s results on three point clouds. For visualization, the normal vector is mapped to the RGB cube. It shows that for complex shapes (pillar, liberty) with high noise levels, the general direction of the normal vector is predicted correctly, but, the fine details and exact normal vector are not obtained. For a basic shape (Boxysmooth) the added noise does not affect the results substantially. Most notably, Nesti-Net shows robustness to point density corruptions. The angular error in each point is visualized in Figure 4 for the different methods using a heat map. For PCA and Jet we display the best result out of the three scales (small, medium, and large, specified above), and for PCPNet the best out of its single-scale and multi-scale options. For all methods, it can be seen that more errors occur near edges, corners and small regions with a lot of detail and high curvature. Nesti-Net suffers the least from this effect due to its scale manager, which allows it to adapt to the different local geometry types.

Figure 5 shows the performance of the scale manager network. A color is assigned to each expert and the chosen expert color is visualized over the point cloud. This provides some insight regarding each expert’s specialization. For example, the figure shows that experts 1, 2 (small scale) specialize in points in regions with high curvatures (near corners). Experts 3 and 4 (medium scale) specialize in the complex cases where multiple surfaces are close to each other, or in the presence of noise. As for the large-scale experts, expert 5 specializes in planar surfaces with normal vectors, which have a large component in the $x$ direction, whereas expert 6 specializes in planar surfaces, which have a large component in the $y$ direction. Expert 5 also specializes in very noisy planar surfaces with a large component in the $z$ direction. Expert 7 (combined scales) plays multiple roles; it handles points on planar surfaces which have a large component in the $z$ direction, complex geometry, and low to medium noise. Figure 6 shows the number of points assigned to each expert for all points in the test set, and the average error per expert. It shows an inverse relation between the number of points assigned to an expert and its average error: the more points assigned to the expert, the lower the error. This is consistent with the definition of the cost function. Timing performance and visualization of additional results are provided in the supplemental material.

4.4 Scale selection performance

We analyze the influence of scale selection on the normal estimation performance. We create several ablations of our method.

•

ss - A single scale version which directly feeds a 3DmFV representation into a CNN architecture (a single-scale MuPS).

•

ms - A multi-scale version which feeds the MuPS representation into a CNN architecture.

•

ms-sw - A multi scale version which first tries to estimate the noise level and then feeds the 3DmFV representation of the corresponding input scale into different sub-networks for a discrete number of noise levels (switching). Note that for this version, the noise level is provided during training.

•

NestiNet - The method described in Section 3 which uses an MoE network to learn the scale.

Details of the architectures for the above methods are provided in the supplemental material.

Table 2 summarizes the results of the scale selection performance analysis. It shows that Nesti-Net’s scale selection performs better than all other variations. This is due to the trained scale-manager network within the MoE. The single-scale version performs well for specific noise-scale pairs but inferior performance for an inadequate scale selection. The multi-scale variations show improvement; however, selecting the correct scale yields improved performance over concatenating multiple scales. The main advantage of Nesti-Net over the switching variation is that the scale prediction is unsupervised, i.e., does not need the additional noise parameters as input during training.

4.5 Results on scanned data

We show qualitative results on scanned point clouds from the ScanNet [8] and NYU Depth V2 [19] datasets in Figure 7. For visualization we project the normal vectors’ color back to the depth image plane. column (c) shows the results for Nesti-Net, trained on synthetic data with Gaussian noise. The estimated normals reveal the nonsmoothness of the scanned data associated with the correlated, non-Gaussian, noise signature associated with the scanning process. Essentially it shows normal estimation of the raw data, rather than the desired normal of the underlying surface. The raw point clouds suffer from ”bumps” which get bigger as the distance from the sensor increases. Further improvement may be obtained by training Nesti-Net on data corrupted with scanner noise and with ground truth normals, but such data is is currently not available and is difficult to manually label. Instead, we train Nesti-Net with normal vectors obtained from applying a Total Variation (TV) algorithm on the depth map, provided by Ladicky et al. [23] for the NYU depth V2 dataset. Note that TV substantially smooths fine detail and uses the depth image rather than unstructured point clouds. Column (d) in Figure 7 shows that after training on the TV data, the normal vector estimation of the underlying surface improves significantly. Column (b) shows the results of PCA with a medium scale for reference, for small radius, the result is significantly noisier and for large radius it over-smooths detail, see supplemental material. Note that Nesti-Net performs the estimation on the raw point cloud and does not use the depth image grid structure.

5 Summary

In this work, we propose multi-scale point statistics, a new representation for 3D point clouds that encodes fine and coarse details while using a grid structure. The representation is effective processed by a CNN architecture (Nesti-Net) for provide accurate normal estimation, which can be used for various applications, e.g. surface reconstruction. The mixture-of-experts design of the architecture enables the prediction of an optimal local scale and provides insights into the network’s resource distribution. The proposed representation and architecture achieve state-of-the-art results relative to all other methods and demonstrate robustness to noise and occlusion data corruptions.

6 Acknowledegment

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Nesti-Net: Normal Estimation for Unstructured 3D Point Clouds using Convolutional Neural Networks

Supplementary Material

A Scale selection methods: architecture details

In Section 4.4 we report the performance of several ablations of our method. Here we detail the architecture of the following ablations:

•

ss - A single scale version which directly feeds a 3DmFV representation into a CNN architecture (a single-scale MuPS); see Table 3.

•

ms - A multi-scale version which feeds the full MuPS representation into a CNN architecture; see Table 3.

•

ms-sw - A multi scale version which first attempts to estimate the noise level and then feeds the 3DmFV representation of the corresponding input scale into two sub-networks for two noise levels (switching). Note that for this version, the noise level is provided during training and we use a predetermined threshold for the sub-network selection; see Table 4.

B Normal estimation performance analysis

In Section 4.3 we report the RMS error metric results for comparison to other methods. The RMS error favors averaging methods. For example, near corners, it will reward a method that estimates an average normal direction rather than a method that estimates the normal of the wrong plane. Therefore, a complimentary metric is required to negate this effect. We use the proportion of good points metric (PGP $\alpha$ ), which computes the percentage of points with an error less than $\alpha$ ; e.g., PGP10 computes the percentage of points with angular error of less than 10 degrees. Table 5 reports the results of PGP10 and Table 6 the results of PGP5 for the baseline methods compared to Nesti-Net.

We show here additional results from section 4.3. Figure 8 shows normal vectors mapped to RGB color at each point. Figure 10 shows the angular error mapped to a heatmap between 0-60. The number above each point cloud is its RMS error. Figure 9 shows the expert (scale) prediction by assigning a color to each expert and visualizing the chosen expert color over the point cloud.

In Section 4.5 we report the normal estimation of Nesti-Net and compare it qualitatively to the PCA results with medium scale. For additional comparison, 11 shows results of PCA for small and large scale. It shows that a small scale produces a noisy output and a large scale over-smooths fine details and corners.

C Time complexity and timing

We subdivide Nesti-Net’s time complexity into its two main stages: MuPS computation and normal estimation. It was shown in [4] that the time complexity of 3DmFV is $O(KT)$ . Here $K$ is the number of Gaussians and $T$ is the number of points in the point cloud. MuPS computes 3DmFV of $n$ scales (point subsets) containing a maximum of $T_{max}$ points. Therefore, its time complexity is $O(nKT_{max})$ . The time complexity of the normal estimation network is constant and proportional to the number of operators in the network. Adding experts to the network increases training time but does not affect test time since only one expert is evaluated during test time. Adding additional scales, however, affects the scale manager network by introducing additional operators. Nevertheless, the normal estimation time is independent of the number of points. We report the time performance of our method and its ablations in Figure 12. It includes timing results for single-scale (ss), multi-scale (ms) and mixture-of-experts (Nesti-Net) using $8^{3}$ Gaussians and $3^{3}$ Gaussians in the ’light’ versions. Timing is measured as a function of the number of points within each scale. Figure 12 shows that choosing a lower number of Gaussians for the MuPS representation significantly improves speed but introduces a tradeoff with accuracy. For example, the average RMS error of ’Nesti-Net light’ is $13.5$ , which is still superior to all other methods but by a smaller margin. We also report the timing results of different methods in Figure 13 and compare to our ’light’ version. Note that the methods were implemented using different frameworks; PCA and Jet were implemented as part of the CGAL library, PCPNet uses pytorch, HoughCNN uses Cuda code directly, and Nesti-Net was implemented using TensorFlow. All measurements were performed on the same machine with a quad-core Intel i7-4770 CPU, 16GB RAM, and an Nvidia GTX 1080 GPU. The figure shows that PCA and Jet are the fastest methods (PCA is slightly faster) and that the learning-based approaches are comparable. All of the results are outside the range of real-time performance. The figure also shows that our method’s timing is not as sensitive to the number of points within each scale as the other methods.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Alliez, D. Cohen-Steiner, Y. Tong, and M. Desbrun. Voronoi-based variational reconstruction of unoriented point sets. In Symposium on Geometry Processing , volume 7, pages 39–48, 2007.
2[2] N. Amenta and M. Bern. Surface reconstruction by voronoi filtering. Discrete & Computational Geometry , 22(4):481–504, 1999.
3[3] Y. Ben-Shabat, T. Avraham, M. Lindenbaum, and A. Fischer. Graph based over-segmentation methods for 3d point clouds. Computer Vision and Image Understanding , 2018.
4[4] Y. Ben-Shabat, M. Lindenbaum, and A. Fischer. 3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters , 3(4):3145–3152, 2018.
5[5] A. Boulch and R. Marlet. Fast and robust normal estimation for point clouds with sharp features. In Computer Graphics Forum , volume 31, pages 1765–1774. Wiley Online Library, 2012.
6[6] A. Boulch and R. Marlet. Deep learning for robust normal estimation in unstructured point clouds. Computer Graphics Forum , 35(5):281–290, 2016.
7[7] F. Cazals and M. Pouget. Estimating differential quantities using polynomial fitting of osculating jets. Computer Aided Geometric Design , 22(2):121–146, 2005.
8[8] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , volume 2, page 10, 2017.