VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang

TL;DR
This paper introduces VisOnlyQA, a dataset revealing that large vision language models struggle with perceiving basic geometric information in images, highlighting a significant gap in current models' visual understanding capabilities.
Contribution
The paper presents VisOnlyQA, a new dataset for evaluating geometric perception in LVLMs, and demonstrates their deficiencies in perceiving geometric properties across diverse tasks.
Findings
LVLMs perform poorly on geometric perception tasks.
Additional training does not improve geometric perception.
Stronger LLMs improve perception, indicating a bottleneck in information processing.
Abstract
Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems
MethodsSparse Evolutionary Training
