TL;DR
Point-SRA introduces a novel 3D representation learning method that aligns multi-level features through self-distillation and probabilistic modeling, improving performance across various 3D tasks.
Contribution
It proposes a dual self-representation alignment mechanism at both MAE and MeanFlow Transformer levels, addressing limitations of fixed mask ratios and point-wise reconstruction.
Findings
Outperforms Point-MAE by 5.37% on ScanObjectNN.
Achieves 96.07% mean IoU on intracranial aneurysm segmentation.
Surpasses MaskPoint by 5.12% AP@50 in 3D object detection.
Abstract
Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
