ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Irene Huang; Wei Lin; M. Jehanzeb Mirza; Jacob A. Hansen; Sivan Doveh,; Victor Ion Butoi; Roei Herzig; Assaf Arbelle; Hilde Kuehne; Trevor Darrell,; Chuang Gan; Aude Oliva; Rogerio Feris; Leonid Karlinsky

arXiv:2406.08164·cs.CV·November 14, 2024·1 cites

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh,, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuehne, Trevor Darrell,, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

PDF

Open Access 1 Repo 1 Models

TL;DR

ConMe introduces a new compositional reasoning benchmark for modern vision-language models, using a VLM-based data generation pipeline that creates challenging questions to better evaluate their reasoning capabilities.

Contribution

The paper presents ConMe, a novel CR benchmark and data generation pipeline that uses VLMs to produce hard reasoning questions, addressing limitations of previous benchmarks.

Findings

01

ConMe causes up to 33% performance decrease in VLMs, revealing their true CR capabilities.

02

The pipeline autonomously generates and selects challenging CR questions, validated manually.

03

ConMe effectively exposes weaknesses in state-of-the-art VLMs, reinstating the CR challenge.

Abstract

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jmiemirza/conme
pytorchOfficial

Models

🤗
conme/ConMe
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsElevator Systems and Control · Software System Performance and Reliability · Advanced Software Engineering Methodologies