DuEqNet: Dual-Equivariance Network in Outdoor 3D Object Detection for   Autonomous Driving

Xihao Wang; Jiaming Lei; Hai Lan; Arafat Al-Jawari; Xian Wei

arXiv:2302.13577·cs.CV·February 28, 2023

DuEqNet: Dual-Equivariance Network in Outdoor 3D Object Detection for Autonomous Driving

Xihao Wang, Jiaming Lei, Hai Lan, Arafat Al-Jawari, Xian Wei

PDF

Open Access

TL;DR

DuEqNet introduces a dual-equivariance framework for outdoor 3D object detection in autonomous driving, enhancing accuracy and robustness against vehicle rotation by extracting local and global equivariant features.

Contribution

The paper proposes DuEqNet, a novel 3D detection network leveraging dual-equivariance at local and global levels, improving detection accuracy under vehicle rotation.

Findings

01

Achieves state-of-the-art performance in 3D detection tasks.

02

Improves orientation accuracy and prediction efficiency.

03

Demonstrates plug-and-play compatibility with existing frameworks.

Abstract

Outdoor 3D object detection has played an essential role in the environment perception of autonomous driving. In complicated traffic situations, precise object recognition provides indispensable information for prediction and planning in the dynamic system, improving self-driving safety and reliability. However, with the vehicle's veering, the constant rotation of the surrounding scenario makes a challenge for the perception systems. Yet most existing methods have not focused on alleviating the detection accuracy impairment brought by the vehicle's rotation, especially in outdoor 3D detection. In this paper, we propose DuEqNet, which first introduces the concept of equivariance into 3D object detection network by leveraging a hierarchical embedded framework. The dual-equivariance of our model can extract the equivariant features at both local and global levels, respectively. For the…

Tables4

Table 1. TABLE I: 3D detection mAP( % percent \% ) and NDS on nuScenes validataion set.

Method	car	peds.	barr.	traf.	truck	bus	trail.	cons.	motor.	bicy.	mAP	NDS
SARPNET [37]	59.9	69.4	38.3	44.6	18.7	19.4	18.0	11.6	29.8	14.2	32.4	48.4
PointPillars [14]	78.7	61.2	41.4	18.9	37.2	49.7	26.2	6.56	20.2	0.85	34.1	49.9
WYSIWYG [38]	80.0	66.9	34.5	27.8	35.8	54.1	28.5	7.50	18.5	0.0	35.4	-
InfoFocus [39]	77.9	63.4	47.8	46.5	31.4	44.8	37.3	10.7	29.0	6.1	39.5	-
SSN [40]	81.0	66.2	49.6	18.7	45.0	53.0	26.1	10.6	41.0	20.5	41.2	54.8
3DSSD [27]	81.2	70.2	47.9	31.1	47.2	61.4	30.5	12.6	36.0	8.63	42.7	56.4
Free-anchor3d [41]	81.2	74.4	52.7	41.4	39.3	48.0	30.9	10.2	43.5	18.0	44.0	55.0
PointPainting [42]	77.9	73.3	60.2	62.4	35.8	36.1	37.3	15.8	41.5	24.1	46.4	58.1
CenterPoint [20]	83.9	77.3	60.1	50.5	50.2	62.0	32.7	10.5	45.4	16.4	48.9	59.6
DuEqNet	84.2	78.9	61.3	56.6	52.4	64.6	32.4	13.0	45.3	16.3	50.5	60.6

Table 2. TABLE II: Per class AOE on nuScenes validataion set.

Method	car	peds.	barr.	truck	bus	trail.	cons.	motor.	bicy.	mAOE( $↓$ )
Free-anchor3d [41]	0.1618	0.3632	0.0550	0.2688	0.2808	0.5437	1.4663	0.6709	0.9576	0.5298
PointPillars [14]	0.1569	0.4265	0.0560	0.2117	0.3295	0.3751	1.5133	0.7792	0.8593	0.5231
SSN [40]	0.1531	0.4103	0.0523	0.1653	0.2022	0.3745	1.2489	0.5336	0.7797	0.4355
CenterPoint [20]	0.1642	0.4108	0.0799	0.1532	0.0791	0.5340	0.9745	0.4279	0.6416	0.3850
DuEqNet	0.1494	0.4013	0.0812	0.1423	0.0475	0.5178	0.8553	0.3622	0.5980	0.3506

Table 3. TABLE III: Ablation studies on nuScenes validation set

Idx.	$𝐋_{𝐞}$	$𝐆_{𝐞}$	NDS	mAP	mAOE
1	✘	✘	59.55	48.90	0.3850
2	✔	✘	59.83	49.28	0.3821
3	✘	✔	60.31	50.17	0.3598
4	✔	✔	60.62	50.49	0.3506

Table 4. TABLE IV: the generalization of dual-equivariance structure on nuScenes validation set

Method		mAP	NDS
PointPillars [14]	w.o.	34.09	48.95
PointPillars [14]	w.	34.59(+0.50)	49.40(+0.45)
SSN [40]	w.o.	41.17	54.56
SSN [40]	w.	42.57(+1.4)	54.77(+0.21)
Free-anchor3d [41]	w.o.	43.96	54.98
Free-anchor3d [41]	w.	45.55(+1.62)	56.07(+1.09)
CenterPoint [20]	w.o.	48.90	59.55
CenterPoint [20]	w.	50.36(+1.46)	60.50(+0.95)

Equations23

B = f_{d e t} (G_{e} (L_{e} (x_{\cdot}^{p_{i}}))),

B = f_{d e t} (G_{e} (L_{e} (x_{\cdot}^{p_{i}}))),

m_{j i}^{(l + 1)} = f_{update} m_{j i}^{(l)}, ζ \in N_{i} \sum f_{agg} (m_{ζ j}^{(l)}, d_{RBF}^{(j i)}),

m_{j i}^{(l + 1)} = f_{update} m_{j i}^{(l)}, ζ \in N_{i} \sum f_{agg} (m_{ζ j}^{(l)}, d_{RBF}^{(j i)}),

f_{update} (φ_{r}^{(l)} (x_{\cdot}^{p_{i}})) = φ_{r}^{(l + 1)} (f_{update} (x_{\cdot}^{p_{i}})),

f_{update} (φ_{r}^{(l)} (x_{\cdot}^{p_{i}})) = φ_{r}^{(l + 1)} (f_{update} (x_{\cdot}^{p_{i}})),

G_{e} (x) = ReLU (BN_{e} (Ψ_{e} (x))),

G_{e} (x) = ReLU (BN_{e} (Ψ_{e} (x))),

[Ψ ⋆ x] (t, r)

[Ψ ⋆ x] (t, r)

= p \in Z^{2} \sum Ψ (r^{- 1} (p - t)) x (p),

(k * f) (\overset{x}{^}) = (L_{r}^{S^{1} \to L_{2} (R^{2})} L_{r}^{R^{2} \to L_{2} (R^{2})} k, f)_{L_{2} (R^{2})},

(k * f) (\overset{x}{^}) = (L_{r}^{S^{1} \to L_{2} (R^{2})} L_{r}^{R^{2} \to L_{2} (R^{2})} k, f)_{L_{2} (R^{2})},

[(t, r) y] (p, s)

[(t, r) y] (p, s)

= y (r^{- 1} (p - t), r^{- 1} s),

[Ψ ⋆ (R x)] (g)

[Ψ ⋆ (R x)] (g)

= y = R y y \in Z^{2} \sum \sum k x_{k} (Ψ_{k} (g^{- 1} R y)

= y \in Z^{2} \sum \sum k x_{k} (y) Ψ_{k} ((R^{- 1} g)^{- 1} y)

= [R (Ψ ⋆ x)] (g),

NDS = \frac{1}{10} [5 mAP + mTP \in TP \sum (1 - min (1, mTP))] .

NDS = \frac{1}{10} [5 mAP + mTP \in TP \sum (1 - min (1, mTP))] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Autonomous Vehicle Technology and Safety · Visual Attention and Saliency Detection

MethodsConvolution

Full text

DuEqNet: Dual-Equivariance Network in Outdoor 3D Object Detection for Autonomous Driving

Xihao Wang1∗, Jiaming Lei2∗, Hai Lan2, Arafat Al-Jawari2, Xian Wei3† Equal technical contribution1* Technical University of Munich, [email protected]2 Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences, [email protected], [email protected], [email protected]3 East China Normal University, [email protected] $\dagger$ Corresponding Author

Abstract

Outdoor 3D object detection has played an essential role in the environment perception of autonomous driving. In complicated traffic situations, precise object recognition provides indispensable information for prediction and planning in the dynamic system, improving self-driving safety and reliability. However, with the vehicle’s veering, the constant rotation of the surrounding scenario makes a challenge for the perception systems. Yet most existing methods have not focused on alleviating the detection accuracy impairment brought by the vehicle’s rotation, especially in outdoor 3D detection. In this paper, we propose DuEqNet, which first introduces the concept of equivariance into 3D object detection network by leveraging a hierarchical embedded framework. The dual-equivariance of our model can extract the equivariant features at both local and global levels, respectively. For the local feature, we utilize the graph-based strategy to guarantee the equivariance of the feature in point cloud pillars. In terms of the global feature, the group equivariant convolution layers are adopted to aggregate the local feature to achieve the global equivariance. In the experiment part, we evaluate our approach with different baselines in 3D object detection tasks and obtain State-Of-The-Art performance. According to the results, our model presents higher accuracy on orientation and better prediction efficiency. Moreover, our dual-equivariance strategy exhibits the satisfied plug-and-play ability on various popular object detection frameworks to improve their performance.

I INTRODUCTION

In recent years, autonomous driving techniques [1, 2, 3] have achieved significant progress covering many scenarios, such as self-driving [4], robotaxis [5], and delivery robots [6]. As the core function of self-driving, precise 3D perception guarantees the safety and reliability of autonomous driving systems. The perception system generally receives multi-modality data in the complicated reality environment, including images from cameras, point clouds from LiDAR scanners, and high-definition maps [7]. Under enriched input information, 3D object detection is one of the most important tasks to assist the automotive agent in understanding its surroundings comprehensively. Thus, several influence components are explored in terms of building an advanced perception performance in 3D object detection, such as object’s shapes [8, 9], sizes [10, 11] and locations [12, 13].

With the development of 3D object detection techniques, more and more advanced methods are adopted for autonomous driving systems. However, in an actual driving situation, the vehicle needs to rotate its direction constantly, which causes a challenge for the perception systems toward the outdoor scenario. Influenced by the variation of the scenario orientation, the object detection accuracy suffered notable deterioration. As illustrated in Figure 1, without any method, the bounding boxes reveal poor quality during the vehicle rotation. Although object orientation is critical for 3D detection in outdoor scenarios, existing methods did not focus on improving the performance of the prediction of orientation. Rotation data augmentation is an indirect method to obtain a better orientation [15]. In contrast, its expansive computation volume and unclear capture effect on orientation-related features indicate that rotation data augmentation is competent for gaining better object orientation prediction.

Therefore, inaccurate orientation prediction caused by consistent rotation challenges the ability of the existing 3D object detection methods. To address the mentioned problem, we propose a dual-equivariance 3D object detection network, called DuEqNet. We introduce a novel hierarchical embedded framework to extract the equivariant features at local and global levels. In particular, inspired by the embedding strategy in a directional message passing network[16], we first propose a novel paradigm to extract local equivariance in pillars, which refers to the pillar-level rotation equivariance. Then, we introduce a lifting layer to generate the global equivariance in the pseudo feature map, which refers to the BEV (Bird’s Eye View)-level rotation equivariance. The experiment results show that DuEqNet could be applied to the autonomous driving multimodel dataset [17] and obtain the State-Of-The-Art (SOTA). Experiments on nuScenes dataset demonstrate that our proposed algorithm achieves a mean Average Orientation Error (mAOE) of $0.3506$ and a mean Average Precision (mAP) of $50.49\%$ , which is better than any of the currently used 3D object detection algorithms.

The main contributions of this paper include:

•

To the best of our knowledge, we are the first to introduce the concept of dual-equivariance. It is a efficient approach that extracts the orientation-related feature in received perception information.

•

Based on the theory of equivariance, we elaborate a dual-equivariance framework for 3D object detection, named DuEqNet, which leverages a hierarchical embedded framework to extract the equivariant features at both local and global levels.

•

In experiments, we achieve the SOTA result in different object detection tasks on the nuScenes dataset. Besides, our strategy exhibits the satisfied plug-and-play ability on various popular object detection frameworks. Moreover, the visualization of target prediction indicates that our method not only improves the orientation accuracy but also has fewer invalid predictions.

II RELATED WORK

II-A 3D object detection algorithm

In this section, we focus on the LiDAR-based 3D Object Detection algorithms, which directly take 3D point cloud data as input and predict 3D oriented bounding boxes (OBBs) to represent the objects in a scene [18, 14, 19, 20]. Most inherit the bottom-up design of deep learning models in which a backbone network is adopted to extract regional feature maps from the input data, with a subsequent detection head to propose candidate OBBs. Compared with images, point clouds are non-grid data that requires extra processing for feature extraction. VoxelNet [21] pioneered a voxelizing approach that establishes sparse voxel grids in which the points are encoded by a voxel feature encoding layer to extract features. This idea has inspired a series of follow-up researches [22, 23]. To improve the speed of voxelization, Lang et al. proposed pointpillars in which voxel size was limited to one among the vertical axis [14]. Beyond voxelization, utilizing graph to encode point cloud is another promising methodology for feature extraction. Shi et al. designed Point-GNN to encode the point cloud into vertex in the graph for prediction of the objects [24]. SVGA-Net constructed the local complete graph and global KNN graph to serve as the attention mechanism for enhancing the extracted features [25].

Another line of methods, the detection head is directly inspired by 2D detectors, and hierarchical architecture is adopted to enhance the performance of the detection head, while 3D objects are relatively small in the whole detection range. Second [26] refined the SSD [27] detection head and become the mainstream backbone in 3D object detection, later CenterPoint [20] which is based on the CenterNet [28] achieves state-of-the-art performance in the 3D object detection task on both nuScenes and Waymo [29] datasets.

II-B Equivariant Network

As a pioneering work, the concept of an equivariant network was proposed in the group of equivariant convolution neural networks (G-CNN) by Cohen and Welling [30]. Compared with normal ConvNet, G-CNN guarantees the rotation equivariance of extracted features under the group operation by a higher degree of weight sharing. Therefore, the expressivity of the equivariant network can be improved without significantly increasing the number of parameters. Depending on group equivariance, followed works extended the equivariance of network from discrete to continuous group. Li et al. [31] applied a rotation transformation on the convolution kernel and proposed three novel convolutional layers. Finzi et al. [32] proposed LieConv, which is theoretically equivariant to transformations from any Lie group, and constructed an equivariant CNN with higher generalization.

Moreover, the concept of equivariant network is also applied in graph neural networks (GNN), which have demonstrated their prominence in dealing with unstructured data, such as molecule and point clouds. These data implicitly incorporate geometric graphs which exhibit symmetries of translations, rotations, and other transformations. However, GNNs are always permutation equivariant but not inherently geometrically equivariant [33]. To attain geometric equivariance, most works have been proposed to modify the parametric functions in GNN, including message passing and local aggregation. TFN [34] realizes translation and rotation equivariance on the group SE(3). To alleviate the burden of heavy computation consumption, DimeNet [35] embeds the direction and distance between atoms as features to obtain rotationally equivariance with a simple model structure.

Although numerous applications of equivariant networks have been confined to molecules, physical dynamics simulation, and indoor point cloud classification, little work established the link between outdoor point cloud object detection and equivariant network due to the massive scale of data. For 3D object detection tasks, the points in each pillar imply a graph containing geometric information (like orientation-related information), but most approaches do not explicitly exploit object rotation equivariance in their models. To tackle this problem, we propose DuEqNet, which focuses on attaching object-level equivariance to the object detection model by designing a simple but effectively equivariant feature through equivariant pillar encoding with GNN.

III Proposed Method

In this section, a brief overview of our network and the preliminary about equivariance representation are given first. Then, we describe the strategy of extracting the local and global equivariance features individually. Finally, we introduce the connection of two levels of features.

III-A Overview and Preliminary

To cope with the challenge induced by scenario rotation, we propose the hierarchical embedded framework to extract equivariance features. The architecture of the network is presented in Figure 2. The architecture of the network can be represented as:

[TABLE]

where $x^{p_{i}}_{\cdot}\in\mathbb{R}^{\alpha}$ denotes each point with in pillar $p_{i}\in\mathbb{R}^{\beta}$ . In detail, $\alpha$ is the initial feature of the input point, and $\beta$ is the embedded pillar’s vector dimension. $\mathbf{L_{e}}$ and $\mathbf{G_{e}}$ represent the local and global equivariance function, respectively. $f_{det}$ is a detection head to detect and regress 3D bounding boxes $\mathcal{B}$ . In terms of the equivariance feature, following the definition of group equivariance [30], a function $f:X\rightarrow Y$ , where $X$ and $Y$ are two homomorphism spaces, is defined as being equivariant if $f(\varphi^{X}_{g}(x))=\varphi^{Y}_{g}(f(x))$ with the group action $g$ in the input space $\varphi^{X}_{g}$ and output space $\varphi^{Y}_{g}$ [35]. The pillars-level and BEV-level features operation could be considered the group action between two homomorphism spaces.

III-B Local Equivariance Feature Extraction

To capture the local geometric information, we apply the pillars as our data representation for 3D object detection. Compared with the voxels, pillars have an unlimited size in the vertical direction. Inspired by the message passing strategy [36] in the graph network, we consider each pillar as the subgraph from input data. Besides, each 3D input point in the pillar is similar to the node $u\in\mathcal{V}$ in the complete graph $\mathcal{G}\in(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ is the vertex and $\mathcal{E}$ is the edge. As the original 3D input does not satisfy the rotation equivariance, we require the embedding for each node $u_{i}$ and neighbor $u_{j}$ by the same learned filter $W(u_{j}-u_{i})$ with the distance of each neighboring input. Hence, the update of passing message $m_{ji}$ between sampled input and its neighborhood $\mathcal{N}_{i}$ with $\sigma$ nodes is defined as:

[TABLE]

where $d_{\mathrm{RBF}}^{(ji)}\in\mathbb{R}^{\alpha}$ denotes the pair-wise point distance represented from the radial basis function, $m_{\zeta j}=W(u_{\zeta}-u_{j})=W(||u_{\zeta}-u_{j}||_{2})\in\mathbb{R}^{\sigma-1}$ and $f_{agg}$ denotes the linear aggregation function. According to the equivariance of the function, we message update function is presented as:

[TABLE]

where $r$ denotes the rotation operation. Thus, we consider that the message passing paradigm $\mathbf{L_{e}}$ in the pillar satisfies the equivariant constraint, which means that the local equivariance feature is extracted in the corresponding pillar.

III-C Global Equivariance Feature Extraction

After extracting the local equivariance features, we construct the module $\mathbf{G_{e}}$ to realize global equivariance between pillars. According to the concept of group equivariant convolution, $\mathbf{G_{e}}$ can be shown as:

[TABLE]

where $\Psi_{e}$ includes space lifting function $\mathcal{L}$ , group convolution $\mathcal{C}$ (including convolution and transposed convolution), x is the pseudo feature map aggregated via local feature extraction, and $BN_{e}$ is the batch normalization which satisfies the definition of group. We explain the reason for applying specific batch normalization in the Section III-D.

With the purpose of achieving the global rotation equivariance, we build the space lifting function $\mathcal{L}$ . It maps the space $X$ of pseudo feature to a larger homomorphism space $Y$ . The lifting convolution $\Psi\star x$ is defined as follows:

[TABLE]

where $x(\text{p})$ denotes the value of a pixel point p in pseudo feature map $x$ , $\Psi$ is the group convolution kernel in space $X$ , and $(t,r)$ denotes the element in group $P_{4}$ . The group $P_{4}$ is the symmetry group that collects all combinations of translations $t$ and 90-degree rotations $r$ about any square grid’s center [30]. Unlike the regular convolution operator, the group convolution consists of the coupled space $\mathbb{R}^{2}\times S^{1}$ of translations in 2D plane space $\mathbb{R}^{2}$ and rotations in 1D spherical space $S^{1}$ . Thus, the representation of the group convolution could be expressed as

[TABLE]

where $\mathbb{L}$ is the left representation, $k$ is the kernel of the convolution, and $\hat{x}=(t,r)$ with the group elements in group $P_{4}$ . Due to the road traffic scene, most of the vehicle’s rotation is 90-degree. Consequently, we decide to choose the rotation operator $r$ belongs to the cyclic group $C_{4}$ . In terms of the definition of equivariance, the output of function $\mathcal{L}$ , $\mathcal{C}$ naturally satisfy:

[TABLE]

where $s$ denotes the additional rotation element of the input feature. For a concise representation, let ${g}=(t,r)\in P_{4}$ , then the rotation equivariance of this layer is presented as follows:

[TABLE]

where $\mathcal{R}$ denotes rotation, and $x$ and $\Psi$ have the same definition as above. Based on the concept of group equivariance, we construct equivariant convolution and transposed convolution to extract rich features following function $\mathcal{L}$ .

III-D Joint of Dual-Equivariance Framework

Equivariant batch normalization $BN_{e}$ : Batch Normalization (BN) can stabilize the intermediate layers, accelerate the convergence of the network, and suppress overfitting to some extent. In order to embed BN into backbone, we base on the definition of group and implement $BN_{e}$ . Considering the process of BN, it only change the data distribution of the output data, but not the space where the output is located. By combining and stacking the above functions, we can build a rotationally equivariant feature extraction module (Figure 3). It consists of lifting $\mathcal{L}$ , convolution $\mathcal{C}$ (convolution and transposed convolution), activation and batch normalization $BN_{e}$ .

Detection head: Considering the diversity of object orientations in 3D scenes and that many orientations are not aligned parallel to the coordinate axes, we adopt a center-based detection head [20] to represent object better and predict object orientation more precisely. In our model, the object is described as points and regress orientation, size, velocity and other attributes.

IV Experiments and Analysis of Results

IV-A Dataset and Evaluation Metrics

We conducted all the experiments in this paper on the nuScenes dataset. As one of the most popular large public datasets in autonomous driving, the nuScenes dataset is equipped with a 32-line LiDAR deployed on the top and middle of the vehicle. It contains a total of $40000$ key frames collected from different scene locations, including $28130$ training samples and $6019$ validation samples with 23 categories of labelled objects, like cars, pedestrians, and cyclists. While for 3D object detection task, we need to detect objects of 10 classes.

For 3D object detection, the most commonly used evaluation metric is the average precision (AP), which evaluates the center distance between the prediction and the ground truth in the bird-eye view in the nuScenes dataset.

Besides AP, nuScenes also measures a series of True Positive metrics (TP metrics) to assess the center distance, size, orientation, velocity and classification deviation between prediction and ground truth. Noted that the mean Average Orientation Error (mAOE) is the one of the importance metric in our experiments, to validate the effectiveness of our Dual-Equivariance structure. Moreover, in order to consider the mAP and all TP metrics, nuScenes proposed the nuScenes detection score (NDS) (Equation (9))

[TABLE]

IV-B Setting in Training and Inference

As described in Section IV-A, this paper trains a 10-class detection network and evaluate its AP, TP and NDS. We focus on mAOE to evaluate and analyze the effectiveness of our proposed algorithm for improving the accuracy of orientation prediction. In all our experiments, we set the detection range of LiDAR point cloud to [-51.2m, 51.2m] for X and Y axis, and [-5m, 3m] for Z axis. Moreover, following the official pre-processing baseline for point cloud data, we aggregate the key frame with $9$ consecutive frames before feeding into the network for the more dense sample frame.

We perform data augmentations on each training, including random rotation, random scaling and random flip along the X or Y axis. In validation and testing, no data augmentation is performed. For a fair comparison, all experiments were conducted on the whole dataset with training 24 epochs on the same machine of GeForce RTX 3090 GPU. The networks were trained and validated on eight 3090 GPUs with the AdamW algorithm with 0.01 weight decay for optimization.

IV-C Results and analysis

In this section, we exhibit the results of our DuEqNet and compare them to other popular methods. Following tables obey these abbreviations: pedestrian (peds.), barrier (barr.), traffic cone (traf.), trailer (trail.), construction vehicle (cons.), motorcycle (motor.) and bicycle (bicy.).

Precision: Firstly, we compare the precision of our network with other popular detection methods, including lidar-based and fusion-based methods. As shown in Table I, DuEqNet outperforms other methods in terms of mAP and NDS. Compared with the second best method CenterPoint, our method shows an absolute improvement of $1.6\%$ and $1.0\%$ in mAP and NDS. It means that DuEqNet not only locates potential objects precisely but also predicts the attribute of objects more comprehensively. Furthermore, DuEqNet gains the highest AP results on most classes, such as cars and pedestrians. For traffic cones, trailers, construction vehicles, and bicycles, PointPainting gets the best APs, which are slightly higher than our DuEqNet. One reasonable reason is that PointPainting accepts LiDAR point cloud data and camera images as input, while DuEqNet only employs point cloud data. Images can offer rich color and texture information, improving the overall performance of the fusion model. DuEqNet obtains $45.3\%$ AP for motorcycles, which is just $0.1\%$ lower than CenterPoint. It has a comparable capability to CenterPoint for detecting motorcycles.

Orientation: In addition, to verify our method for orientation prediction, we evaluate the AOE of each class and the mAOE in Table II. Our method DuEqNet achieves the lowest mAOE and performs best in most classes. More specifically, it surpasses CenterPoint, the most popular method, with a relative improvement of $8.9\%$ . The significant improvement of AOE is supported by our visualization results in Figure 4(a), where our methods show a better orientation prediction.

Ablation study: We provide ablation studies to assess the effectiveness of the proposed dual-equivariance structure. Recall that our dual-equivariance structure consists of two parts: local and global equivariance feature extraction ( $\mathbf{L_{e}}$ and $\mathbf{G_{e}}$ for concise representation in tables, respectively). For baseline (remarked as Idx. 1 in Table III), we adopt the same method as PointPillars [14] in pillar encoding and replace global equivariance feature extraction with popular convolution backbone [20], [14].

In Table III, with dual-equivariance structure, our DuEqNet (Idx. 4) achieves the best results including NDS, mAP and mAOE. Concretely, compared to the baseline, we attain absolute progress by $1.07$ , $1.59$ concerning NDS and mAP and $8.94\%$ orientation error reduction. Note that, from the results of baseline and method 3, the global equivariance feature extraction substantially lowers the mAOE from $0.3850$ to $0.3598$ . It shows a more powerful effect on orientation than the local equivariance feature extraction. This can be attributed to the size of objects: most objects occupy several pillars in point cloud. The global equivariance feature extraction can capture the relationship between pillars which is beneficial for orientation regression.

Generalization: In order to investigate the generalization of our dual-equivariance structure, we conduct a further experiment. We simply replace the parts of pillar encoding and feature extraction with our dual-equivariance structure in the popular 3D object detection methods.

As is shown in Table IV, we compare four methods, including PointPillars, SSN, Free-anchor3d and CenterPoint, on nuScenes validation set. With our dual-equivariance structure, all methods receive a considerable increment in mAP and NDS, demonstrating the generalization of our dual-equivariance structute.

IV-D Visualization Analysis

Setting the visual range of the x-axis and y-axis as $[-40m,40m]$ , we implement the visualization analysis of the detection results of DuEqNet and other detection methods under bird’s eye view. Presenting in Figure 4. The blue bounding boxes represent the ground truth, and the green ones mean prediction boxes. The line in the boxes indicates the direction of the objects. From Figure 4, DuEqNet can obtain better orientation prediction and effectively reduce the occurrence of leak and false detection. Visualization results demonstrate the validness of dual-equivariance feature extraction, which is beneficial to orientation prediction and performance improvement.

V Conclusion

In this paper, we present a dual equivariance network for outdoor 3D object detection named DuEqNet, which employs a hierarchical embedded framework to extract the equivariance features at local and global levels. Through this efficient approach, the challenge brought by scenario rotation in autonomous driving is effectively mitigated. We demonstrate that our dual-equivariance concept advances the accuracy of object detection. Moreover, our network has generalization for other methods to improve their performance. In the days of autonomous driving development, the concept of dual equivariance introduced by our network provides a fresh perspective on enhancing self-driving safety.

ACKNOWLEDGMENT

This work was partially supported by ’Fujian Science & Technology Innovation Laboratory for Optoelectronic Information of China’ (Grant 2021ZZ120), ’FuJian Science and Technology Plan’ (Grant 2021T3003) and ’QuanZhou Science and Technology Plan’ (Grant 2021C065L).

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang, “Gs 3d: An efficient 3d object detection framework for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 1019–1028.
2[2] J. Mao, M. Niu, C. Jiang, H. Liang, J. Chen, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li, et al. , “One million scenes for autonomous driving: Once dataset,” ar Xiv preprint ar Xiv:2106.11037 , 2021.
3[3] R. Nabati and H. Qi, “Rrpn: Radar region proposal network for object detection in autonomous vehicles,” in 2019 IEEE International Conference on Image Processing (ICIP) . IEEE, 2019, pp. 3093–3097.
4[4] W. Zeng, S. Wang, R. Liao, Y. Chen, B. Yang, and R. Urtasun, “Dsdnet: Deep structured self-driving network,” in European conference on computer vision . Springer, 2020, pp. 156–172.
5[5] R. Cervero, “Mobility niches: jitneys to robo-taxis,” Journal of the american planning association , vol. 83, no. 4, pp. 404–412, 2017.
6[6] D. Lee, G. Kang, B. Kim, and D. H. Shim, “Assistive delivery robot application for real-world postal services,” IEEE Access , vol. 9, pp. 141 981–141 998, 2021.
7[7] J. Mao, S. Shi, X. Wang, and H. Li, “3d object detection for autonomous driving: A review and new outlooks,” ar Xiv preprint ar Xiv:2206.09474 , 2022.
8[8] M. Zeeshan Zia, M. Stark, and K. Schindler, “Are cars just 3d boxes?-jointly estimating the 3d shape of multiple objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2014, pp. 3678–3685.