Do Pre-trained Vision-Language Models Encode Object States?

Kaleb Newman; Shijie Wang; Yuan Zang; David Heffren; Chen Sun

arXiv:2409.10488·cs.CV·September 17, 2024

Do Pre-trained Vision-Language Models Encode Object States?

Kaleb Newman, Shijie Wang, Yuan Zang, David Heffren, Chen Sun

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether pre-trained vision-language models can encode object states, revealing they excel at object recognition but struggle with physical state distinctions, and suggests areas for improvement.

Contribution

The study introduces ChangeIt-Frames, a dataset for object state recognition, and evaluates nine VLMs, highlighting their limitations in encoding object states and proposing targeted improvements.

Findings

01

VLMs reliably perform object recognition

02

VLMs fail to distinguish object physical states accurately

03

Identifies key areas for enhancing object state encoding

Abstract

For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects' physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

brown-palm/object-states
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications