ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond   Visual Common Sense

Kankan Zhou; Eason Lai; Wei Bin Au Yeong; Kyriakos Mouratidis; Jing; Jiang

arXiv:2310.19301·cs.CL·October 31, 2023·1 cites

ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

Kankan Zhou, Eason Lai, Wei Bin Au Yeong, Kyriakos Mouratidis, Jing, Jiang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ROME, a dataset designed to evaluate whether pre-trained vision-language models can reason beyond common sense, revealing that most models struggle with counter-intuitive scenarios involving visual content.

Contribution

The paper presents ROME, a novel dataset for probing reasoning beyond common sense in vision-language models, highlighting current limitations of state-of-the-art models in understanding counter-intuitive images.

Findings

01

Most models fail on counter-intuitive scenarios

02

Models tend to rely on common-sense assumptions

03

ROME reveals significant reasoning gaps in current models

Abstract

Humans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a vision-language model, whose reasoning could gravitate towards the common scenario that the fish is inside the bowl, despite the visual input. In this paper, we introduce a novel probing dataset named ROME (reasoning beyond commonsense knowledge) to evaluate whether the state-of-the-art pre-trained vision-language models have the reasoning capability to correctly interpret counter-intuitive content. ROME contains images that defy commonsense knowledge with regards to color, shape, material, size and positional relation. Experiments on the state-of-the-art pre-trained vision-language models reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k-square-00/rome
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsRank-One Model Editing