A Study of Commonsense Reasoning over Visual Object Properties
Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski

TL;DR
This paper introduces a systematic framework and benchmarks for evaluating vision-language models' reasoning over object properties, revealing significant gaps compared to human performance especially in complex and counterfactual reasoning tasks.
Contribution
It presents a new evaluation framework and benchmark datasets for assessing VLMs' reasoning abilities over object properties across multiple reasoning levels and image types.
Findings
VLMs perform below 40% in counting accuracy
VLMs achieve below 70% in comparison accuracy
Models struggle with photographic images and counterfactual reasoning
Abstract
Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
