ModelNet40-E: An Uncertainty-Aware Benchmark for Point Cloud Classification
Pedro Alonso, Tianrui Li, Chongshou Li

TL;DR
ModelNet40-E is a new benchmark for evaluating point cloud classification models' robustness and uncertainty calibration under synthetic LiDAR-like noise, providing both noisy data and uncertainty annotations.
Contribution
It introduces ModelNet40-E, the first benchmark with noise and uncertainty annotations for assessing robustness and calibration of point cloud classifiers.
Findings
Point Transformer v3 shows superior calibration under noise.
All models' accuracy degrades with increased noise.
Uncertainty predictions correlate with measurement noise.
Abstract
We introduce ModelNet40-E, a new benchmark designed to assess the robustness and calibration of point cloud classification models under synthetic LiDAR-like noise. Unlike existing benchmarks, ModelNet40-E provides both noise-corrupted point clouds and point-wise uncertainty annotations via Gaussian noise parameters ({\sigma}, {\mu}), enabling fine-grained evaluation of uncertainty modeling. We evaluate three popular models-PointNet, DGCNN, and Point Transformer v3-across multiple noise levels using classification accuracy, calibration metrics, and uncertainty-awareness. While all models degrade under increasing noise, Point Transformer v3 demonstrates superior calibration, with predicted uncertainties more closely aligned with the underlying measurement uncertainty.
Peer Reviews
Decision·Submitted to ICLR 2026
1. The presentation of this paper is clear and easy to follow. 2. Multi-metric reporting: Accuracy, ECE, AUROC for error detection, and Pearson correlation between σ (true measurement uncertainty) and predicted uncertainty—separately on correct-only vs. all samples to isolate “awareness” from trivial confidence collapse. 3. Training-dynamics analysis: Longer training can improve accuracy but harm calibration and uncertainty awareness, reinforcing why deployment metrics must go beyond accuracy
1. Benchmark is tied to ModelNet40 geometry and categories; real LiDAR scenes (e.g., autonomous driving) differ in distribution and occlusion patterns, limiting external validity without cross-dataset evidence. There is no visualization either. 2. The LiDAR model is parametric (linear range noise, cosine angle term, uniform outlier sampling). It’s plausible but may not capture sensor-specific phenomena (e.g., intensity-dependent dropout, multi-return behavior) beyond the chosen parameters in Ta
(+) The authors point out that point-cloud benchmarks typically assess only accuracy and ignore uncertainty—a key limitation for safety-critical robotics and autonomous driving applications. The proposed benchmark directly addresses this issue. (+) Multiple canonical and modern architectures are trained and tested under unified settings, and the analysis spans calibration curves, AUROC, and correlation metrics, offering some empirical insights.
(-) While the benchmark is thoughtfully designed, its core contributions (benchmark extension and noise simulation) are largely incremental with respect to prior corruption studies such as ModelNet-C or PointCloud-C. The added uncertainty annotations and LiDAR noise formulation, though practical, do not represent a fundamentally new methodological direction. (-) Although the manuscript aims at realism, all data remain synthetic (based on CAD models). Without real sensor validation or comparison
- The focus on calibration and uncertainty awareness in addition to accuracy is important for real-world safety-critical applications, where overconfident misclassifications can be harmful. The LiDAR-inspired corruption (range-dependent noise, incidence-angle effects, bias, and outliers) goes beyond simple jittering or dropout, making this benchmark more physically grounded. - Results reveal trade-offs between clean accuracy and robustness. For instance, SimpleView, despite weaker clean accurac
- While ModelNet40-E extends a classic dataset, it is still synthetic, small-scale data. Conclusions may not fully carry over to large-scale real-world LiDAR datasets (e.g., nuScenes, Waymo, SemanticKITTI). Additionally, there are several closely related works that were not discussed: Uncertainty Estimation and Out-of-Distribution Detection for LiDAR Scene Semantic Segmentation (ECCV 2024), Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding (WACV 2025), and MSC-Bench: Ben
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Remote Sensing and LiDAR Applications · Image Processing and 3D Reconstruction
