Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate
Soumava Paul, Prakhar Kaushik, Alan Yuille

TL;DR
This paper evaluates the reliability of multiview 3D consistency metrics, introduces a benchmark and a robustness analysis, revealing that many metrics can hallucinate geometry and noise, and proposes improved failure-aware metrics.
Contribution
It introduces enchmark, a robustness benchmark for multiview 3D consistency, and a parametric family of neural metrics that are more robust and failure-aware.
Findings
Existing metrics can hallucinate dense geometry and support unrelated scenes.
The proposed COLMAP-based metrics correlate better with human judgments.
The new metrics are up to 4 times more aligned with human perception.
Abstract
Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to more robust. Our analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
