STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Mahiro Ukai, Shuhei Kurita, Nakamasa Inoue

TL;DR
This paper introduces STATUS Bench, a comprehensive benchmark and dataset for evaluating and improving vision-language models' ability to recognize and understand subtle object states and their changes across diverse scenarios.
Contribution
It presents the first rigorous benchmark and large-scale dataset for object state understanding in vision-language models, along with an evaluation scheme for multiple related tasks.
Findings
Current VLMs perform at chance level on object state tasks.
Fine-tuning on the new dataset improves model performance significantly.
STATUS Bench reveals the need for specialized training to understand object states.
Abstract
Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
