VisMin: Visual Minimal-Change Understanding

Rabiul Awal; Saba Ahmadi; Le Zhang; Aishwarya Agrawal

arXiv:2407.16772·cs.CV·January 23, 2025

VisMin: Visual Minimal-Change Understanding

Rabiul Awal, Saba Ahmadi, Le Zhang, Aishwarya Agrawal

PDF

1 Datasets 1 Video

TL;DR

VisMin introduces a challenging benchmark for visual-language models that tests their ability to understand minimal differences in images and captions, revealing current limitations and enabling targeted improvements.

Contribution

The paper presents a new benchmark, VisMin, for fine-grained understanding, along with an automatic framework for dataset creation and a large-scale training dataset for model finetuning.

Findings

01

Current VLMs struggle with spatial and counting understanding.

02

Finetuning with the generated dataset improves model performance.

03

Resources and benchmarks are publicly released for future research.

Abstract

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mair-lab/earl-datasets
dataset· 19 dl
19 dl

Videos

VisMin: Visual Minimal-Change Understanding· slideslive

Taxonomy

MethodsContrastive Language-Image Pre-training · Focus · Diffusion