Understanding Depth and Height Perception in Large Visual-Language   Models

Shehreen Azad; Yash Jain; Rishit Garg; Yogesh S Rawat; Vibhav Vineet

arXiv:2408.11748·cs.CV·April 28, 2025

Understanding Depth and Height Perception in Large Visual-Language Models

Shehreen Azad, Yash Jain, Rishit Garg, Yogesh S Rawat, Vibhav Vineet

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the geometric understanding of large Vision Language Models, focusing on their ability to perceive depth and height, revealing significant shortcomings and proposing a benchmark for future improvements.

Contribution

Introduces GeoMeter, a benchmark suite for assessing depth and height perception in VLMs, and benchmarks 18 models to identify their limitations in geometric reasoning.

Findings

01

VLMs excel at shape and size perception

02

Models struggle with depth and height perception

03

Depth and height reasoning are limited in current VLMs

Abstract

Geometric understanding - including depth and height perception - is fundamental to intelligence and crucial for navigating our environment. Despite the impressive capabilities of large Vision Language Models (VLMs), it remains unclear how well they possess the geometric understanding required for practical applications in visual perception. In this work, we focus on evaluating the geometric understanding of these models, specifically targeting their ability to perceive the depth and height of objects in an image. To address this, we introduce GeoMeter, a suite of benchmark datasets - encompassing 2D and 3D scenarios - to rigorously evaluate these aspects. By benchmarking 18 state-of-the-art VLMs, we found that although they excel in perceiving basic geometric properties like shape and size, they consistently struggle with depth and height perception. Our analysis reveal that these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sacrcv/dh-bench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsFocus