Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

Valentina Lilova; Toyesh Chakravorty; Julian I. Bibo; Emma Boccaletti; Brandon Li; L\'ivia Baxov\'a; Cees G. M. Snoek; Mohammadreza Salehi

arXiv:2512.11574·cs.CV·January 19, 2026

Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li, L\'ivia Baxov\'a, Cees G. M. Snoek, Mohammadreza Salehi

PDF

Open Access

TL;DR

This paper introduces a new benchmark for evaluating the intrinsic 3D understanding of foundation models without fine-tuning, extending existing 2D scene understanding frameworks to 3D multi-view scenarios.

Contribution

It presents a novel in-context 3D scene understanding benchmark based on the Hummingbird framework and 3D Multi-View ImageNet, enabling direct assessment of dense visual features.

Findings

01

DINO-based encoders perform well across large viewpoint shifts

02

Benchmarking 7 state-of-the-art models reveals varying 3D understanding capabilities

03

The benchmark provides a fine-grained evaluation across different difficulty levels

Abstract

Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream fine-tuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pre-trained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no fine-tuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images depicting objects at specific camera angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis