Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene   Understanding

Yunze Man; Shuhong Zheng; Zhipeng Bao; Martial Hebert; Liang-Yan Gui,; Yu-Xiong Wang

arXiv:2409.03757·cs.CV·May 9, 2025

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui,, Yu-Xiong Wang

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates various visual foundation models for complex 3D scene understanding, revealing their strengths and limitations across multiple tasks and scenarios to guide future model selection.

Contribution

It provides a comprehensive comparison of seven vision encoders across four scene understanding tasks, highlighting key performance insights and challenging existing assumptions.

Findings

01

DINOv2 outperforms other models in overall performance

02

Video models excel in object-level scene understanding

03

Diffusion models enhance geometric task performance

Abstract

Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yunzeman/lexicon3d
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Advanced Vision and Imaging · 3D Surveying and Cultural Heritage

MethodsDiffusion