Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis
Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li, L\'ivia Baxov\'a, Cees G. M. Snoek, Mohammadreza Salehi

TL;DR
This paper introduces a new benchmark for evaluating the intrinsic 3D understanding of foundation models without fine-tuning, extending existing 2D scene understanding frameworks to 3D multi-view scenarios.
Contribution
It presents a novel in-context 3D scene understanding benchmark based on the Hummingbird framework and 3D Multi-View ImageNet, enabling direct assessment of dense visual features.
Findings
DINO-based encoders perform well across large viewpoint shifts
Benchmarking 7 state-of-the-art models reveals varying 3D understanding capabilities
The benchmark provides a fine-grained evaluation across different difficulty levels
Abstract
Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream fine-tuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pre-trained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no fine-tuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images depicting objects at specific camera angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis
