ModelNet40-E: An Uncertainty-Aware Benchmark for Point Cloud Classification

Pedro Alonso; Tianrui Li; Chongshou Li

arXiv:2508.01269·cs.CV·September 30, 2025

ModelNet40-E: An Uncertainty-Aware Benchmark for Point Cloud Classification

Pedro Alonso, Tianrui Li, Chongshou Li

PDF

Open Access 3 Reviews

TL;DR

ModelNet40-E is a new benchmark for evaluating point cloud classification models' robustness and uncertainty calibration under synthetic LiDAR-like noise, providing both noisy data and uncertainty annotations.

Contribution

It introduces ModelNet40-E, the first benchmark with noise and uncertainty annotations for assessing robustness and calibration of point cloud classifiers.

Findings

01

Point Transformer v3 shows superior calibration under noise.

02

All models' accuracy degrades with increased noise.

03

Uncertainty predictions correlate with measurement noise.

Abstract

We introduce ModelNet40-E, a new benchmark designed to assess the robustness and calibration of point cloud classification models under synthetic LiDAR-like noise. Unlike existing benchmarks, ModelNet40-E provides both noise-corrupted point clouds and point-wise uncertainty annotations via Gaussian noise parameters ({\sigma}, {\mu}), enabling fine-grained evaluation of uncertainty modeling. We evaluate three popular models-PointNet, DGCNN, and Point Transformer v3-across multiple noise levels using classification accuracy, calibration metrics, and uncertainty-awareness. While all models degrade under increasing noise, Point Transformer v3 demonstrates superior calibration, with predicted uncertainties more closely aligned with the underlying measurement uncertainty.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The presentation of this paper is clear and easy to follow. 2. Multi-metric reporting: Accuracy, ECE, AUROC for error detection, and Pearson correlation between σ (true measurement uncertainty) and predicted uncertainty—separately on correct-only vs. all samples to isolate “awareness” from trivial confidence collapse. 3. Training-dynamics analysis: Longer training can improve accuracy but harm calibration and uncertainty awareness, reinforcing why deployment metrics must go beyond accuracy

Weaknesses

1. Benchmark is tied to ModelNet40 geometry and categories; real LiDAR scenes (e.g., autonomous driving) differ in distribution and occlusion patterns, limiting external validity without cross-dataset evidence. There is no visualization either. 2. The LiDAR model is parametric (linear range noise, cosine angle term, uniform outlier sampling). It’s plausible but may not capture sensor-specific phenomena (e.g., intensity-dependent dropout, multi-return behavior) beyond the chosen parameters in Ta

Reviewer 02Rating 2Confidence 4

Strengths

(+) The authors point out that point-cloud benchmarks typically assess only accuracy and ignore uncertainty—a key limitation for safety-critical robotics and autonomous driving applications. The proposed benchmark directly addresses this issue. (+) Multiple canonical and modern architectures are trained and tested under unified settings, and the analysis spans calibration curves, AUROC, and correlation metrics, offering some empirical insights.

Weaknesses

(-) While the benchmark is thoughtfully designed, its core contributions (benchmark extension and noise simulation) are largely incremental with respect to prior corruption studies such as ModelNet-C or PointCloud-C. The added uncertainty annotations and LiDAR noise formulation, though practical, do not represent a fundamentally new methodological direction. (-) Although the manuscript aims at realism, all data remain synthetic (based on CAD models). Without real sensor validation or comparison

Reviewer 03Rating 2Confidence 5

Strengths

- The focus on calibration and uncertainty awareness in addition to accuracy is important for real-world safety-critical applications, where overconfident misclassifications can be harmful. The LiDAR-inspired corruption (range-dependent noise, incidence-angle effects, bias, and outliers) goes beyond simple jittering or dropout, making this benchmark more physically grounded. - Results reveal trade-offs between clean accuracy and robustness. For instance, SimpleView, despite weaker clean accurac

Weaknesses

- While ModelNet40-E extends a classic dataset, it is still synthetic, small-scale data. Conclusions may not fully carry over to large-scale real-world LiDAR datasets (e.g., nuScenes, Waymo, SemanticKITTI). Additionally, there are several closely related works that were not discussed: Uncertainty Estimation and Out-of-Distribution Detection for LiDAR Scene Semantic Segmentation (ECCV 2024), Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding (WACV 2025), and MSC-Bench: Ben

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Remote Sensing and LiDAR Applications · Image Processing and 3D Reconstruction