Dilated Point Convolutions: On the Receptive Field Size of Point   Convolutions on 3D Point Clouds

Francis Engelmann; Theodora Kontogianni; Bastian Leibe

arXiv:1907.12046·cs.CV·May 26, 2020

Dilated Point Convolutions: On the Receptive Field Size of Point Convolutions on 3D Point Clouds

Francis Engelmann, Theodora Kontogianni, Bastian Leibe

PDF

1 Repo

TL;DR

This paper introduces Dilated Point Convolutions (DPC), which significantly increase the receptive field size in 3D point cloud processing, leading to improved performance in tasks like segmentation and classification.

Contribution

The paper proposes a novel dilation mechanism for point convolutions that can be integrated into existing networks to enhance receptive field size.

Findings

01

Receptive field size correlates with task performance.

02

DPC significantly enlarges receptive fields in point convolutional networks.

03

Networks with DPC achieve competitive benchmark scores.

Abstract

In this work, we propose Dilated Point Convolutions (DPC). In a thorough ablation study, we show that the receptive field size is directly related to the performance of 3D point cloud processing tasks, including semantic segmentation and object classification. Point convolutions are widely used to efficiently process 3D data representations such as point clouds or graphs. However, we observe that the receptive field size of recent point convolutional networks is inherently limited. Our dilated point convolutions alleviate this issue, they significantly increase the receptive field size of point convolutions. Importantly, our dilation mechanism can easily be integrated into most existing point convolutional networks. To evaluate the resulting network architectures, we visualize the receptive field and report competitive scores on popular point cloud benchmarks.

Tables4

Table 1. TABLE I: 3D Semantic segmentation on S3DIS (A5) and ScanNet V2.

	Method	mIoU	mAcc	oAcc
S3DIS Area 5	PointNet [18]	41.1	49.0	-
	KWYND [6]	52.2	59.1	84.2
	PointCNN [14]	57.3	63.9	85.9
	SPG [12]	58.0	66.5	86.4
	PCNN [25]	58.3	67.0	-
	DPC (Ours)	61.28	68.38	86.78
ScanNet	DPC (Val-set)	59.52	67.21	85.95
ScanNet	DPC (Test-set)	59.2	-	-

Table 2. TABLE II: Object classification scores on ModelNet40

Method	# Points	oAcc	mAcc
PointNet[18]	1k	89.2	86.2
PointNet++(with normals)[20]	5k	91.9	-
Kd-Net[10]	32k	91.8	88.5
EdgeConv[26]	1k	92.2	90.2
SO-Net(with normals)[13]	5k	92.4	90.8
SpiderCNN(with normals)[29]	1k	92.4	-
DPC (Ours) with normals	4k	93.1	91.4

Table 3. TABLE III: Ablation study: stacking point convolutions and varying kernel size k 𝑘 k . Dataset: S3DIS Area 5.

PointConvs	Neighbors $k$	Forward-Pass	Parameters	mIoU	mAcc
Number of	Number of	Time per	Number of
3	5	12.10 ms	$402 \cdot 10^{3}$	50.04	57.42
3	10	13.64 ms	$402 \cdot 10^{3}$	50.98	58.16
3	20	17.65 ms	$402 \cdot 10^{3}$	52.25	60.83
5	5	14.53 ms	$625 \cdot 10^{3}$	52.69	58.87
5	10	17.12 ms	$625 \cdot 10^{3}$	52.91	59.57
5	20	23.35 ms	$625 \cdot 10^{3}$	53.27	60.15
7	5	16.99 ms	$880 \cdot 10^{3}$	52.93	59.87
7	10	20.68 ms	$880 \cdot 10^{3}$	53.57	60.92
7	20	29.38 ms	$880 \cdot 10^{3}$	53.93	61.73

Table 4. TABLE IV: Ablation Study: Dilated Point Convolutions. Varying dilation factors d 𝑑 d . Dataset: S3DIS Area 5.

PointConvs	Neighbors $k$	Forward-Pass	Parameters	$d$	mIoU	mAcc
Number of	Number of	Time per	Number of	Dilation
7	20	29.38 ms	$880 \cdot 10^{3}$	1	53.93	61.73
7	20	31.57 ms	$880 \cdot 10^{3}$	2	55.83	61.76
7	20	35.36 ms	$880 \cdot 10^{3}$	8	61.28	68.38
7	20	51.65 ms	$880 \cdot 10^{3}$	16	58.79	65.84

Equations8

\big{(}f*g\big{)}(p_{i})=\int_{-\infty}^{+\infty}f(p_{j})\odot g(p_{i}-p_{j})\,\mathop{dp_{j}},

\big{(}f*g\big{)}(p_{i})=\int_{-\infty}^{+\infty}f(p_{j})\odot g(p_{i}-p_{j})\,\mathop{dp_{j}},

\big{(}f*g\big{)}(p_{i})\approx\frac{1}{N}\sum_{n=1}^{N}f(p_{n})\odot g(p_{i}-p_{n}),

\big{(}f*g\big{)}(p_{i})\approx\frac{1}{N}\sum_{n=1}^{N}f(p_{n})\odot g(p_{i}-p_{n}),

g (p; θ) = MLP (p; θ),

g (p; θ) = MLP (p; θ),

\big{(}f*g\big{)}(p_{i})\approx\frac{1}{|\mathcal{N}_{i}|}\sum_{p_{k}\in\mathcal{N}_{i}}f(p_{k})\odot g(p_{i}-p_{k}).

\big{(}f*g\big{)}(p_{i})\approx\frac{1}{|\mathcal{N}_{i}|}\sum_{p_{k}\in\mathcal{N}_{i}}f(p_{k})\odot g(p_{i}-p_{k}).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

francisengelmann/DPC
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Dilated Point Convolutions: On the Receptive Field Size

of Point Convolutions on 3D Point Clouds ††thanks: 1: All authors are with the Computer Vision Group, Visual Computing Institute at RWTH Aachen University in Aachen, Germany.

Francis Engelmann1, Theodora Kontogianni1, Bastian Leibe1

Abstract

In this work, we propose Dilated Point Convolutions (DPC). In a thorough ablation study, we show that the receptive field size is directly related to the performance of 3D point cloud processing tasks, including semantic segmentation and object classification. Point convolutions are widely used to efficiently process 3D data representations such as point clouds or graphs. However, we observe that the receptive field size of recent point convolutional networks is inherently limited. Our dilated point convolutions alleviate this issue, they significantly increase the receptive field size of point convolutions. Importantly, our dilation mechanism can easily be integrated into most existing point convolutional networks. To evaluate the resulting network architectures, we visualize the receptive field and report competitive scores on popular point cloud benchmarks.

I Introduction

The past years have witnessed a tremendous development of 3D scene understanding methods on several tasks including semantic segmentation [18], object detection [32], and instance segmentation [5]. Recent advancements such as point convolutional layers [25, 26, 27] which can directly operate on 3D point clouds further boosted the field.

In the 2D image domain, analyzing the receptive field is an important tool for diagnosing and comprehending convolutional neural networks (CNN). The receptive field of a neural unit describes the region of the input data that influences its output value. All input data outside of the receptive field does not contribute to the output. Hence, large receptive fields are important since they enable reasoning on a larger input context.

Current successful architectures operating on grid-like data (e.g. images [22, 23, 8]), increase the receptive field implicitly by using deeper network architectures. However, only few works explicitly study the influence of receptive fields in the domain of 2D image CNNs [15, 17]. So far, there is no work analyzing the receptive fields of deep networks operating directly on 3D point clouds. Such a study is particularly challenging, since the theoretical size of receptive fields is difficult to compute due to the non-uniform structure of 3D point clouds. Nevertheless, we argue that the concept of receptive fields is equally important in the 3D domain.

Point convolutional layers [25, 26, 27] are a major driving force behind the success of networks that can directly operate on unstructured data such as 3D point clouds. Furthermore, they can be seen as a generalization of discrete convolutions. While continuous point convolutions operate on data sampled at continuous positions in space, discrete convolutions operate on grid-structured data such as images or voxel-grids, i.e. the data is sampled at discrete positions.

As such, we propose to visualize the receptive fields to analyze different network architectures and we present a thorough ablation study comparing several strategies which increase the receptive field of point convolutions. Specifically, we look at common strategies to increase the receptive field by 1) stacking convolutional layers and 2) using larger kernel sizes. By visually analyzing the extent of the resulting receptive fields, we notice that their influence still remains rather limited. Motivated by these observations, we propose Dilated Point Convolutions as a means to significantly increase the receptive field size of point convolutions.

The paper is structured as follows: We start by discussing current methods for 3D point cloud processing and existing works analyzing receptive fields on discrete convolutions. Then, we review Point Convolutions as an instance of continuous convolutions on 3D point clouds. Next, we describe and visualize well established methods for increasing receptive fields, which leads us to the derivation of Dilated Point Convolutions. Finally, in the experimental section, we compare the aforementioned strategies.

Our contributions are as follows: (1) We evaluate most commonly used strategies to increase the receptive fields in current methods using point convolutions. (2) We propose to visualize the receptive field of point convolutions to make educated network design choices. (3) From these observations, we derive Dilated Point Convolutions (DPC) as an elegant mechanism to significantly increase the receptive field size. (4) Using DPCs we are able to report competitive scores on the task of 3D semantic segmentation on S3DIS [1] and ScanNet [4] as well as shape classification on ModelNet40 [28].

II Related Work

2D Projection Representation. Qi et al. [19] and Boulch et al. [2] project 3D point clouds into 2D representations, then apply 2D convolutional networks and finally fuse the results back into 3D space. These type of projections do not make use of the underlying geometric structure as they only operate on the projected appearance of the point clouds.

3D Volumetric Grid Representation. Maturana and Scherer [16] and Song et al. [28] voxelize point clouds into regular volumetric grids and apply 3D convolutions. These approaches are constrained by the fixed resolution of the 3D grid. Coarse grids lead to loss of detail and fine ones suffer from high memory and computational costs. The use of octrees [21] and kd-trees [10] offer improved grid resolutions. Recently, Graham et al. [7] offered a speed- and memory-efficient approach for sparse 3D convolutions which are applied only on occupied voxels. However, voxelized point clouds can still be problematic if adjacent points are far apart, which can hinder information flow.

3D Feature Learning on Point Sets. Numerous methods operate directly on 3D point clouds [18, 25, 26, 27]. They follow-up on the seminal work of PointNet [18] which applies point-wise multi-layer-perceptrons (MLP) followed by max-pooling over all points to extract a global point cloud descriptor but fails to capture local structure. Local structure is implicitly considered in 2D images and 3D voxels by using spatial grids. Filters that incorporate the information of the neighboring points in the grid are then learned. Numerous methods rely on similar types of spatial neighborhoods in an unstructured point cloud: Hua et al. [9] compute nearest neighbors on the fly and bin them into spatial cells before using fully convolutional networks. Landrieu and Boussaha [11] compute neighborhoods by over-segmenting 3D point clouds into superpoints. However, the most popular method used by [6, 14, 20, 26, 27] consists in computing the $k$ nearest neighbors (KNN) of every point to represent its neighborhood. EdgeConvs [26] establish this neighborhood on the feature space while PointConv [27], PointNet++ [20] and PointCNN [14] use the spatial coordinates. Engelmann et al. [6] use KNN in the feature space and k-means in the world coordinate system to create neighborhoods.

Receptive Field Analysis. Few works systematically study the influence of receptive fields on 2D image CNNs [15, 17]. In general, deeper networks which stack multiple layers of 2D convolutions have proven to work better [22, 23]. Dilated convolutions [30], previously introduced as atrous convolutions [3], used in 2D image semantic segmentation, allow to efficiently enlarge the receptive field of filters to incorporate larger context without increasing the number of model parameters. In this work, we propose a simple yet effective dilation mechanism for 3D point convolutions.

III Approach

In this section, we formally define point convolutions and examine the importance of a large receptive field size in the context of 3D point cloud processing. We revisit existing strategies to increase the receptive field. Then, we propose our main contribution dilated point convolutions, an elegant yet easy technique to significantly increase the receptive field size of point convolutional networks.

III-A Point Convolutions

Point convolutions can be formulated using the general definition of continuous convolutions in a $D$ -dimensional space. Continuous convolutions are defined as

[TABLE]

where $\odot$ is the Hadamard-product of the continuous feature function $f:\mathbb{R}^{D}\rightarrow\mathbb{R}^{F}$ assigning a feature-vector $f(p_{j})\in\mathbb{R}^{F}$ to each position $p_{j}\in\mathbb{R}^{D}$ , and the continuous kernel function $g:\mathbb{R}^{D}\rightarrow\mathbb{R}^{F}$ mapping a relative position to a kernel weight. In the case of 3D point clouds, we have $D=3$ and the feature vector could for example contain the point position, color, and normal such that $f(p)\in\mathbb{R}^{9}$ , see Figure 2. In most practical applications, e.g. reconstructed 3D point clouds, the feature function $f$ is not fully known since only a limited number $N$ of point positions $p_{n}$ are observed or even occupied. Using Monte-Carlo integration, the continuous convolution can then be approximated as

[TABLE]

where recent methods implement the kernel function $g(\cdot)$ as a learned parametric function based on a multi-layer perceptron (MLP)

[TABLE]

where $p$ is the relative position between two points and $\theta$ is a set of learned parameters. In order to extract high-frequency signals it is important to define localized kernels [31]. In 2D image CNNs, this is implemented by e.g. $3\times 3$ or $5\times 5$ pixel kernels. For point convolutions, this effect is achieved by limiting the cardinality of the local kernel support, i.e. by defining a local neighborhood $\mathcal{N}_{i}$ around each point $p_{i}$

[TABLE]

The above definition of continuous convolutions is used in Wang et al. [25] and PointConv [27] which additionally proposes to weight the kernel function using the inverse local density to compensate for the non-uniform distribution of point samples. In SpiderCNN, Xu et al. [29] propose to replace the MLP by a combination of step functions and Taylor expansions to capture rich spatial information. A broader interpretation of continuous convolutions is used in EdgeConv [26], where the kernel function $g(\cdot)$ is not only defined over relative positions but also over the difference of learned point features. Independent of the concrete implementation, all previously mentioned methods, including PointCNN [14], rely on $k$ nearest neighbors (KNN) to define a local neighborhood $\mathcal{N}$ resulting in local kernels. Next, after looking at the receptive field size, we use KNN neighborhoods to define dilated point convolutions.

III-B Receptive Field Size.

A large receptive field is directly related to the performance of point convolutional networks (Section IV). Thus our goal is to increase the size of the receptive field. The receptive field (or field of view) of a neural unit within a deep network describes the region of the input point cloud that influences the output of that particular unit. In the context of 3D semantic segmentation, where the task is to assign a semantic label to each point in a given point cloud, the final decision on the label for a particular point is influenced only by those points which lie inside the receptive field of that particular point. All other points outside the receptive field do not contribute to the decision, see Figure 4. It is thus essential to design architectures with receptive fields large enough to cover the necessary context for each point.

A common approach to increase the receptive field size, similar to 2D architectures, consists in stacking multiple (point) convolutional layers. EdgeConvs [26] stack 3 convolutional layers, SpiderCNN [29] use 4 layers and PCCN [25] use 8. Here, we compare 3, 5 and 7 layers, see Table III.

Increasing the kernel size of the convolution is another popular technique. In the setup of point convolutions this effect is achieved by selecting a larger number $k$ of nearest neighbors. Note, however, that this does not increase the number of model parameters since the kernel weights are computed over relative point positions using the parametric kernel function $g(\cdot)$ , see Table III. This is in stark contrast to convolutions defined over discrete grid positions (e.g. 2D image CNN) where a larger kernel increases the number of model parameters.

III-C Dilated Point Convolutions.

Using the previously mentioned approaches, the receptive field size still remains limited, see top 3 rows in Figure 4. Therefore, we propose dilated point convolutions (DPC) as an elegant yet efficient mechanism to increase the receptive field size. DPCs are equal to point convolutions (PC), however, they differ in the way they select neighboring points: While PCs directly use the $k$ nearest neighbors, DPCs first compute the $k\cdot d$ nearest neighbors and then select every $d$ -th neighbor, see Figure 2 (right). Note that for $d\,$ = $\,1$ , DPCs are identical to PCs. The dilation causes a significantly increased receptive field size (see Figure 4). However, the number of parameters remains unchanged. The larger number $k\cdot d$ of neighbors that needs to be computed adds a sublinear computational overhead. See Table IV. Another positive aspect about DPC is that they can directly be added – with minimal modifications – to most existing point convolutional networks, if the local kernel neighborhood $\mathcal{N}$ originates from a nearest neighbor search.

IV Experiments

Model Architecture. In all our experiments, we use a deep convolutional model as depicted in Figure 3. The main branch (shown in green) consists of stacked (dilated) point convolutions. The $k$ nearest neighbors (KNN) for each point are computed on-the-fly. The final point features are concatenated with global features obtained by max-pooling over the concatenated point features at different depth-levels.

IV-A 3D Semantic Segmentation

Task and Metrics. The goal is to predict a semantic label for each point in a given point cloud. This task is especially well-suited to analyze the effectiveness of larger receptive fields, since the label of each point is only influenced by points in its receptive field. We adopt the commonly used metrics: mean intersection over union (mIoU), mean class accuracy (mAcc), and overall accuracy (oAcc).

Datasets. We evaluate on two datasets: (1) Stanford Large-Scale 3D Indoor Spaces (S3DIS) [1] contains dense 3D point clouds from 6 large-scale indoor areas, consisting of 271 rooms from 3 different buildings. The points are annotated with 13 semantic classes. We use the common train/test split, which trains on all areas except Area 5 which we keep for testing [1, 24, 25]. (2) ScanNet v2 [4] contains 3D scans of a wide variety of indoor scenes, including apartments, hotels, conference rooms and offices. The dataset contains 20 valid semantic classes. We use the public training, validation and test split of 1201, 312 and 100 scans, respectively.

Training Details. We train our networks using the Adam optimizer and exponential-decay learning-rate scheduling. During training we randomly sample 4092 points from crops of 3 m side length. This differs from most concurrent methods which train on 1 m or 1.5 m crops. Since our model has a much larger receptive field it can learn to make use of this additional context. In general, small training crops could hinder the network to learn from larger context as soon as the size of the receptive field exceeds the size of the training crops. Points are sampled without replacement and we use zero-padding if there are less than 4092 points.

Results and Discussion

We report scores of our best performing models on the ScanNet v2 dataset [4] and the S3DIS dataset [1] in Table I.

Our dilated point convolutional model is able to outperform other recent KNN-based point convolutional networks by a significant margin on S3DIS, and provides competitive scores on ScanNet, specifically among point convolutional approaches. In Figure 5, we show qualitative results on the ScanNet validation dataset. We highlight wrong predictions in red (see right-most column).

IV-B 3D Object Classification

Dataset. ModelNet40 consists of CAD models that belong to one of 40 different categories. We use the official split of 9843 shapes for training and 2468 for testing, as in [20]. We randomly sample 4,000 points from the 3D model of an object. The input features are the 3D coordinates and the surface normals (6 input channels in total). Comparison. Table II shows the comparison between our method and prior methods. We report overall classification accuracy (oAcc) and mean classification accuracy (mAcc). Next, we present an ablation study of all model hyper-parameters.

IV-C Ablation Study

We perform an ablation study on the previously introduced mechanisms for increasing the receptive field size. The hyper-parameters that we analyze in particular are the number of point convolutional layers, the nearest neighbors $k$ and the dilation factor $d$ . The main results are presented in Table III and Table IV. The ablation studies are performed on Area 5 of the S3DIS dataset [1]. In the following, we discuss the influence of the individual parameters.

Depth and Number of Neighbors k (Table III).

By increasing the number of convolutional layers, we can build deeper networks. Similar to discrete convolutions, deep point convolutional networks perform better than shallow ones. Equally, the performance increases with the number of neighbors. However, increasing the number of neighbors increases the computational cost, resulting in slower inference times. Furthermore, increasing the number of convolutions leads to additional memory consumption.

Dilation Factor d (Table IV).

Dilated Point Convolutions are an efficient tool to rapidly increase the receptive field of convolutions. Using dilation, the receptive field can be increased significantly (Figure 4) at constant memory requirements and a marginal increment in processing time. The improved performance on the semantic segmentation task shows that indeed a larger receptive field is important. However, the rapidly increasing receptive field resulting in large receptive fields in later layers is also responsible for sparsely sampled neighborhoods in earlier layers. We assume that this makes it harder for the network to learn high-frequency or local features. In future work, it could be interesting to investigate deep convolutional networks using Dilated Point Convolutions with a dilation rate $d$ that increases with the depth of the network. Intuitively, such a network could learn localized signals in the earlier stages and higher level information at later stages.

Model Size. Note that, since the kernel function $g(p)$ is defined over relative point positions $p$ , the number of trainable parameters is independent of the number of neighbors $k$ (and hence the dilation factor $d$ ). As such, increasing the number of neighbors $k$ (or the dilation factor $d$ ) increases the receptive field without increasing the model size.

V Conclusion

In this work, we reviewed several mechanisms to increase the receptive field size of 3D point convolutions. We analyzed and compared different network architectures based on the receptive field size which we showed to be directly related to the performance of point convolutional networks. Specifically, we have proposed dilated point convolutions as an elegant and efficient technique to significantly increase the receptive field size of point convolutions. As a result, we were able to report solid improvements over well-known baseline methods for 3D semantic segmentation and 3D object classification. More importantly, our dilation mechanism can easily be integrated into most existing point convolutional networks. We hope these insights enable the research community to develop better performing models.

Acknowledgements: This work was supported by the ERC Consolidator Grant DeeViSe(ERC-2017-COG-773161). We thank Mats Steinweg, Dan Jia, Jonas Schult and Alexander Hermans for their valuable feedback.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3D Semantic Parsing of Large-Scale Indoor Spaces. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.
2[2] Alexandre Boulch, Joris Guerry, Bertrand Le Saux, and Nicolas Audebert. Snap Net: 3D point cloud semantic labeling with 2D deep segmentation networks. In Computers & Graphics , 2017.
3[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CR Fs. In International Conference on Learning Representations (ICLR) , 2015.
4[4] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scan Net: Richly-annotated 3D Reconstructions of Indoor Scenes. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.
5[5] Cathrin Elich, Francis Engelmann, Jonas Schult, Theodora Kontogianni, and Bastian Leibe. 3D-BEVIS: Birds-Eye-View Instance Segmentation. In German Conference on Pattern Recognition (GCPR) , 2019.
6[6] Francis Engelmann, Theodora Kontogianni, Jonas Schult, and Bastian Leibe. Know What Your Neighbors Do: 3D Semantic Segmentation of Point Clouds. In European Conference on Computer Vision Workshop (ECCV’W) , 2018.
7[7] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
8[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.