A Study of Commonsense Reasoning over Visual Object Properties

Abhishek Kolari; Mohammadhossein Khojasteh; Yifan Jiang; Floris den Hengst; Filip Ilievski

arXiv:2508.10956·cs.CV·January 16, 2026

A Study of Commonsense Reasoning over Visual Object Properties

Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a systematic framework and benchmarks for evaluating vision-language models' reasoning over object properties, revealing significant gaps compared to human performance especially in complex and counterfactual reasoning tasks.

Contribution

It presents a new evaluation framework and benchmark datasets for assessing VLMs' reasoning abilities over object properties across multiple reasoning levels and image types.

Findings

01

VLMs perform below 40% in counting accuracy

02

VLMs achieve below 70% in comparison accuracy

03

Models struggle with photographic images and counterfactual reasoning

Abstract

Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Abk802/ORBIT
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques