VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Ryo Kamoi; Yusen Zhang; Sarkar Snigdha Sarathi Das; Ranran Haoran Zhang; Rui Zhang

arXiv:2412.00947·cs.CL·July 15, 2025

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper introduces VisOnlyQA, a dataset revealing that large vision language models struggle with perceiving basic geometric information in images, highlighting a significant gap in current models' visual understanding capabilities.

Contribution

The paper presents VisOnlyQA, a new dataset for evaluating geometric perception in LVLMs, and demonstrates their deficiencies in perceiving geometric properties across diverse tasks.

Findings

01

LVLMs perform poorly on geometric perception tasks.

02

Additional training does not improve geometric perception.

03

Stronger LLMs improve perception, indicating a bottleneck in information processing.

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

psunlpgroup/visonlyqa
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems

MethodsSparse Evolutionary Training