Surfel-based 3D Registration with Equivariant SE(3) Features

Xueyang Kang; Hang Zhao; Kourosh Khoshelham; Patrick Vandewalle

arXiv:2508.20789·cs.CV·August 29, 2025

Surfel-based 3D Registration with Equivariant SE(3) Features

Xueyang Kang, Hang Zhao, Kourosh Khoshelham, Patrick Vandewalle

PDF

Open Access

TL;DR

This paper introduces a surfel-based 3D registration method that learns explicit SE(3) equivariant features, improving robustness to noise and rotations in point cloud alignment tasks.

Contribution

It proposes a novel surfel-based pose regression approach that incorporates SE(3) equivariant convolutional features for more accurate and robust point cloud registration.

Findings

01

Outperforms state-of-the-art methods on indoor datasets

02

Demonstrates robustness to noisy and rotated point clouds

03

Effective in both indoor and outdoor 3D reconstruction scenarios

Abstract

Point cloud registration is crucial for ensuring 3D alignment consistency of multiple local point clouds in 3D reconstruction for remote sensing or digital heritage. While various point cloud-based registration methods exist, both non-learning and learning-based, they ignore point orientations and point uncertainties, making the model susceptible to noisy input and aggressive rotations of the input point cloud like orthogonal transformation; thus, it necessitates extensive training point clouds with transformation augmentations. To address these issues, we propose a novel surfel-based pose learning regression approach. Our method can initialize surfels from Lidar point cloud using virtual perspective camera parameters, and learns explicit $SE (3)$ equivariant features, including both position and rotation through $SE (3)$ equivariant convolutional kernels to predict…

Equations8

ϵ_{i} = γ \cdot C \frac{e ^{- \overset{ρ}{^}}}{1 + e ^{- t a n (θ)}},

ϵ_{i} = γ \cdot C \frac{e ^{- \overset{ρ}{^}}}{1 + e ^{- t a n (θ)}},

θ = arccos (\frac{r \cdot o}{∥ r ∥ ∥ o ∥}) .

θ = arccos (\frac{r \cdot o}{∥ r ∥ ∥ o ∥}) .

\overset{ρ}{^} = min (max (ρ, ρ_{min}), ρ_{ma x}) .

\overset{ρ}{^} = min (max (ρ, ρ_{min}), ρ_{ma x}) .

L_{Huber} = {\frac{1}{2} (\hat{x}_{i^{*}} - y_{j^{*}})^{2}, δ (∣ \hat{x}_{i^{*}} - y_{j^{*}} ∣ - \frac{1}{2} δ), if ∣ \hat{x}_{i^{*}} - y_{j^{*}} ∣ \leq δ otherwise

L_{Huber} = {\frac{1}{2} (\hat{x}_{i^{*}} - y_{j^{*}})^{2}, δ (∣ \hat{x}_{i^{*}} - y_{j^{*}} ∣ - \frac{1}{2} δ), if ∣ \hat{x}_{i^{*}} - y_{j^{*}} ∣ \leq δ otherwise

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · 3D Shape Modeling and Analysis · Medical Image Segmentation Techniques

Full text

SURFEL-BASED 3D REGISTRATION WITH EQUIVARIANT SE(3) FEATURES

††thanks: Thanks for Graduate Research Scholarships of the University of Melbourne to sponsor Xueyang for his PhD study at Melbourne.

Xueyang Kang

0000-0001-7159-676X *University of Melbourne, KU Leuven

*Parkville VIC 3052, Austraslia

[email protected]

Hang Zhao

https://orcid.org/0000-0002-5279-0273 *University of Melbourne

*Parkville VIC 3052, Austraslia

[email protected]

Kourosh Khoshelham

0000-0001-6639-1727 *University of Melbourne

*Parkville VIC 3052, Austraslia

[email protected]

Patrick Vandewalle

0000-0002-7106-8024 *KU Leuven

*3000 Leuven, Belgium

[email protected]

Abstract

Point cloud registration is crucial for ensuring 3D alignment consistency of multiple local point clouds in 3D reconstruction for remote sensing or digital heritage. While various point cloud-based registration methods exist, both non-learning and learning-based, they ignore point orientations and point uncertainties, making the model susceptible to noisy input and aggressive rotations of the input point cloud like orthogonal transformation; thus, it necessitates extensive training point clouds with transformation augmentations. To address these issues, we propose a novel surfel-based pose learning regression approach. Our method can initialize surfels from Lidar point cloud using virtual perspective camera parameters, and learns explicit $\mathbf{SE(3)}$ equivariant features, including both position and rotation through $\mathbf{SE(3)}$ equivariant convolutional kernels to predict relative transformation between source and target scans. The model comprises an equivariant convolutional encoder, a cross-attention mechanism for similarity computation, a fully-connected decoder, and a non-linear Huber loss. Experimental results on indoor and outdoor datasets demonstrate our model superiority and robust performance on real point-cloud scans compared to state-of-the-art methods.

Index Terms:

Surfel, Equivariant CNN Kernel, E2PN, $\mathbf{SO(3)}$ , $\mathbf{SE(3)}$ , Point Cloud Registration, Huber Loss.

I Introduction

Point cloud registration is crucial in 3D reconstruction, shape pose estimation, digital twin in remote sensing, and various applications, aiming to estimate relative transformations between source and target 3D point cloud scan pairs. These point clouds are usually generated from 3D scan sensor like LiDAR, structured light, RGB-D or stereo camera.[1, 2].

Registration algorithms are generally categorized as rigid or non-rigid. Rigid methods, such as Iterative Closest Point (ICP) and its variants [3], estimate transformations through point-wise error optimization. Non-rigid approaches consider potential deformations or non-linear distortions, catering to scenarios with deformable objects or dynamic environments [4, 5]. Traditional methods rely on non-learning-based iterative optimizations such as ICP [6], which are often slow to converge, and suffer from high outlier-inlier ratios. In contrast, deep learning-based techniques convert raw input points into high-dimensional feature spaces for direct correspondence establishments, as demonstrated by DGR [7] and PointDSC [8]. SpinNet [9] and RoReg [10] are either $\mathbf{SO(2)}$ or $\mathbf{SO(3)}$ equivariant, to learn rotation and position from point cloud jointly. But they may fail to predict registration pose given noisy points, or challenging sensor motion, e.g., in-image-plane rotation or camera motion parallel to image plane.

To address these challenges, we employ surfel representation, the small, oriented disks un-projected from depth map aligned with view image, or from LiDAR point cloud. Dahl et al.[11] first demonstrates use of surfel in large-scale 3D reconstruction, Behley and Stachniss [12] apply it to self-driving scenarios. Pfister et al. [13] further leverage surfels for high-fidelity mesh surfaces in 3D graphical rendering. Surfels can be considered as 2D Gaussians. Compared to point clouds, the 2D surfel-based Gaussian approach exhibits superior robustness by leveraging data uncertainties, as evidenced by the recent Gaussian Splatting technique [14]. Furthermore, we introduce a specialized equivariant network model to learn $\mathbf{SE(3)}$ equivariant features, including rotation and translation from surfel input through E2PN [15], Pairwise equivariant features undergo cross attention to create an attention-based similarity feature map for correspondence establishment. These features are then processed through fully-connected layers to estimate relative transformation for alignment. Lastly, we also introduce a specific $\mathbf{SE(3)}$ differentiable Huber loss function for surfel-based registration.

II Related Work

Despite significant progress in point cloud representations for 3D reconstruction [16], existing methods often lack built-in equivariance, limiting their robustness in 3D registration tasks. While prior works on 3D registration have explored both traditional optimization-based approaches and deep feature learning, they do not explicitly incorporate rotational equivariance at the architectural level. Applications of 3D Surfels SurfelMeshing [17] has been applied to 3D mapping in indoor scenes, while other approaches have used surfels to large-scale outdoor mapping [18].

3D Registration Registration techniques are widely used in shape alignment [7, 19], remote sensing [20, 21] and deformable target scanning [22]. Traditional non-learning methods like Iterative Closest Point (ICP), kiss-ICP [6], and point-to-plane ICP [23] struggled with nonlinear optimization errors. In contrast, state-of-the-art deep learning models like PointDSC [8] and Deep Global Registration (DGR) [7] explore searching correspondences in high-dimensional feature spaces or Superpoint descriptors [24]. Max-Clique [25], GeoTransformer [26], and Equi-GSPR [27] leverage the latest graph or attention learning to create network backbone.

Equivariant Feature Representation Equivariant models have emerged as robust solutions for 3D applications such as point cloud registration and pose estimation. These models maintain $\mathbf{SE(3)}$ rotational and translational equivariance, essential for consistent alignment performance. Examples include Spherical CNNs [28], group CNNs [29], and SE(3)-Transformers [30].

SpinNet [9] and RoReg [10] leverage rotation-equivariant features for point cloud registration.

III Method

For perspective image input, we can initialize surfels from the unprojected depth map and its associated normal map. For Lidar, we create a virtual camera plane in front of the sensor with fixed Field of View (FoV), then the surfel uncertainties can be calculated based on a sensor virtual perspective projection model, accounting for both inverse distance uncertainty and Lidar ray direction angle to the sensor principal axis. The whole initialization process serves as pre-processing. For LiDAR point cloud, surfels can be created from the neighboring non-coplanar triplet points for normal vector estimation, and 1D uncertainty can be derived proportionally from the point density. To ensure registration efficiency, surfels from source and target frames are downsampled before feeding into the neural network model. The model architecture, as illustrated in fig. 1, consists of three components: an $\mathbf{SE(3)}$ equivariant convolution kernel-based encoder, a cross-attention, and a decoder for predicting relative transformation.

III-A Surfel Initialization

Each surfel in the source frame, indexed by $i$ , consists of three main components: the 3D position $\mathbf{x}_{i}\in\mathbb{R}^{3}$ , the normal vector $\mathbf{n}_{i}\in\mathbb{R}^{3}$ , and a scalar radius $\epsilon_{i}\in\mathbb{R}$ . The surfel $\mathbf{y}_{j}$ is in the target frame. The surfel center position is determined by the point cloud position. To calculate the normal map from the point cloud, a parallel CUDA-based PCA plane normal initialization from the K-nearest-neighboring points is employed. The normal determines the orientation of the surfel disk. The surfel radius $\epsilon_{i}$ is then derived by the expression:

[TABLE]

where $C$ is the normalization factor, $\gamma$ is the point intensity normalied to probability, and $\theta$ represents the view angle between the virtual camera principal axis $\vec{o}$ and the ray $\vec{r}$ emitted from the sensor center through the pixel location, as expressed in the following equation,

[TABLE]

The value of $\rho$ represents the inverse of the depth, which is then truncated to $\hat{\rho}$ within the inverse depth range $(\rho_{min},\rho_{max})$ of the sensor,

[TABLE]

The colored point cloud position, the normals calculated from the neighboring points of interest points, and the initialized surfel uncertainties (radius) overlaid onto points are presented separately in fig. 2. This surfel radius heuristic initializes uncertainty with the virtual sensor perspective model, resulting in smaller radii for points near the image centre and larger radii for points further away. Additionally, the radius increases with depth distance, modelling depth measurement noise.

III-B Network Structure

Given initialized surfels of source frame $\mathbf{s}_{i}\in{\mathbf{s}_{1},...,\mathbf{s}_{N}}$ and target frame $\mathbf{s}_{j}\in{\mathbf{s}_{1},...,\mathbf{s}_{N}}$ , all surfels are encoded by the same encoder $f_{\theta}(\cdot)$ . Notably, the position and normal vectors $\mathbf{n}_{(\cdot)}$ , $\mathbf{p}_{(\cdot)}$ of each surfel are weighted by a factor of $(1-\|\epsilon(\cdot)\|)$ to reduce the influence of highly uncertain surfels. The encoder architecture is augmented on E2PN [15], with doubled feature dimensions compared to the original point cloud input. Equivariance is maintained through a symmetric conv-kernel $\mathcal{\kappa}$ arranged in an icosahedral shape solids. $\mathbf{SE}(3)$ is defined as Special Euclidean Group, the group of rigid body transformations in 3D space, including rotations and translations. An element of $\mathbf{SE}(3)$ is represented as a matrix $3\times 4$ , $\mathbf{T}=[\mathbf{R},\mathbf{t}]$ , where $\mathbf{R}\in\mathbf{SO}(3)$ is a $3\times 3$ rotation matrix, and $\mathbf{t}\in\mathbb{R}^{3}$ is a translation vector. $\mathbf{SO}(3)$ : Special Orthogonal Group in 3D is the group of all 3D rotations. In the following convention, the symbol ′ next to a symbol indicates discretization operation. E2PN encoder features are aligned in the spherical space $\mathbf{S}^{2^{\prime}}\times\mathcal{R}^{3}$ , where coordinates of each feature vertex in $\mathcal{R}^{3}$ , associated with the 128-dimension feature descriptor, and $\mathbf{S}^{2^{\prime}}$ signifies the discretized sphere surface. This discretized feature representation is determined by $\mathbf{SO}(3)^{\prime}/\mathbf{SO}(2)^{\prime}$ , where $\mathbf{SO}(2)^{\prime}$ is a subgroup of $\mathbf{SO}(3)$ . The quotient space is defined as a group of rotations $\mathbf{R}_{i/j}$ with a same endpoint, such as the sphere north pole after rotation. This discretization of $\mathbf{SO}(3)$ facilitates more efficient and accelerated learning. The $\mathbf{SE}(3)$ feature is constructed by extending the $\mathbf{SO}(3)$ rotation feature, incorporating translation through the concatenation of point coordinates.

The surfel undergoes convolution with two distinct symmetric kernels, $\kappa_{1}$ and $\kappa_{2}$ , as shown in fig. 3, relating to point position and normal respectively. These are concatenated post-convolution. The icosahedron comprises 60 rotations, each denoted by various permutation orders $\mathbf{R}_{(\cdot)}$ . The E2PN symmetric kernel selects one rotation from the 60 options to generate the output equivariant feature. This is achieved by choosing the maximum sum of each rotation feature along the 12-channel dimension.

The transformation applied to the input point cloud is denoted by group rotation $g^{\prime}(\cdot)$ . The key idea of equivariant feature learning is that the output features of the encoder are transformed accordingly to preserve $\mathbf{SE(3)}$ equivariance $f_{\theta}(g^{\prime}(x))=g^{\prime}(f_{\theta}(x))$ . This ensures that the features are equivariant to the input transformations. The equ-features after the E2PN encoder of the source and target frames are denoted as $\mathbf{D}_{i},i\in({1,...,12})$ and $\mathbf{D}_{j},j\in({1,...,12})$ , where each has 128 dimensions. Next, we use the linear layer to project each descriptor into a triplet composed of $\mathbf{Q}_{(\cdot)},\mathbf{K}_{(\cdot)},\mathbf{V}_{(\cdot)}$ tokens. We use the same index convention throughout the paper, where $i$ denotes the index of the source frame, and $j$ refers to the target frame. The triplet tokens, derived from feature descriptors at the 12 corners of icosahedral planoids (see fig. 3), are aggregated from neighboring surfel coordinates. This effectively fuses the input surfels into 12 distinct regions, which are then used in the subsequent cross-attention module. The cross-attention $g_{\theta}(\cdot)$ is then applied to calculate the attention-weighted equivariant features (attention map in $12\times 12$ ) from the pairwise frames, finally to be decoded by the fully connected layers into the transformation estimation.

III-C Loss Function

Inspired by the node-wise supervision in PointDSC [8], we adapt the original binary cross-entropy loss to a non-linear Huber loss. This maps the transformed point error into $\mathcal{L}_{2}$ norm when it is small and into $\mathcal{L}_{1}$ normal when the error is large. The point position from the source frame is transformed by the predicted rotation $\mathbf{R}$ and translation $\mathbf{t}$ into $\mathbf{y}_{j^{*}}=\mathbf{R}\hat{\mathbf{x}}_{i^{*}}+\mathbf{t}$ . The rotation matrix $\mathbf{R}$ is derived from the predicted quaternion $\mathbf{q}$ . The Huber loss is defined as below,

[TABLE]

The threshold is set to 0.6m. The correspondence index pair $(i,j)$ is established using the nearest neighboring point search.

IV Experiment Results

To evaluate the model performance, we utilized outdoor dataset KITTI [31]. Our model was trained on each dataset separately for fair comparison against other baseline models. We employed 3D voxel-based downsampling to generate 2048 points unprojected from the depth map of each frame for surfel initialization.

Evaluation Details. We use the Rotation Error (RE) and Translation Error (TE) to evaluate the accuracy of rotation and translation separately. Furthermore, we incorporate the Registration Recall (RR) and F1 score as registration success evaluation metrics. We compare our model with popular deep learning-based models, including PointDSC [8], Deep Global Registration [7] (DGR), Maximal Clique [25] (MAC). Additionally, we choose equivariant methods, like SpinNet [9], RoReg [10] and GeoTransformer [26] for KITTI, but for RoReg, it does not provide pre-trained weights on KITTI, therefore, we trained it from scratch, but the results are really bad thus we are not reporting it in the paper. For all the feature point descriptor-based approaches, like DGR [7], FPHF [32] descriptor is used for evaluation on KITTI [31]. We provide quantitative evaluation results in table I, and qualitative comparison results including top three baseline models in fig. 4. All these results exhibit superior performance of our model over baselines.

Ablation Study. 1) We perform eight different types of ablation tests to verify the contribution of each design to our model performance, as shown in table II. Point cloud position as input fed into various SOTA point cloud encoders, or vanilla E2PN encoder cannot achieve performance on par with our surfel-based equi-model design. The uncertainties and Huber loss are all beneficial to improving the model performance. 2) We provide the robustness analysis of input scans perturbed by various levels of rotation and translation in table III from small to big transform perturbation. 3) Finally, we provide model complexity comparisons for the top five baseline models, and our model in table IV as shown below to showcase the low latency and small model size complexity of our model compared to other baselines.

Our model has a total inference time of 0.090 seconds (latency) per source-target point scan (2048 points). Additionally, we present a more detailed runtime and module size analysis of the entire model in table V. The primary inference time is attributed to the E2PN encoder.

V Conclusion

We present a surfel-based $\mathbf{SE(3)}$ -equivariant network designed for robust 3D registration, leveraging surfel initialization from raw depth maps or LiDAR point clouds. Our approach integrates a shared E2PN encoder, a cross-attention module, and an MLP-based decoder. Experimental results on two datasets demonstrate the strong performance of the model, highlighting its potential for real-world applications in 3D reconstruction, mapping, and augmented reality. Future work could extend this framework to more complex scenarios, such as large-scale scene reconstruction and robust registration under extreme occlusions or sparse views. Exploring dynamic scene understanding is another promising direction.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Izadi et al. [2011] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison et al. , “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera,” in Proceedings of the 24th annual ACM symposium on User interface software and technology , 2011, pp. 559–568.
2Huai et al. [2015] J. Huai, Y. Zhang, and A. Yilmaz, “Real-time large scale 3d reconstruction by fusing kinect and imu data,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences , vol. 2, pp. 491–496, 2015.
3Segal et al. [2009] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: science and systems , vol. 2, no. 4. Seattle, WA, 2009, p. 435.
4Stückler and Behnke [2014] J. Stückler and S. Behnke, “Multi-resolution surfel maps for efficient dense 3d modeling and tracking,” Journal of Visual Communication and Image Representation , vol. 25, no. 1, pp. 137–147, 2014.
5Myronenko and Song [2010] A. Myronenko and X. Song, “Point set registration: Coherent point drift,” IEEE transactions on pattern analysis and machine intelligence , vol. 32, no. 12, pp. 2262–2275, 2010.
6Low [2004] K.-L. Low, “Linear least-squares optimization for point-to-plane icp surface registration,” Chapel Hill, University of North Carolina , vol. 4, no. 10, pp. 1–3, 2004.
7Choy et al. [2020] C. Choy, W. Dong, and V. Koltun, “Deep global registration,” in CVPR , 2020.
8Bai et al. [2021] X. Bai, Z. Luo, L. Zhou, H. Chen, L. Li, Z. Hu, H. Fu, and C.-L. Tai, “Pointdsc: Robust point cloud registration using deep spatial consistency,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 15 859–15 869.