ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh,, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuehne, Trevor Darrell,, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

TL;DR
ConMe introduces a new compositional reasoning benchmark for modern vision-language models, using a VLM-based data generation pipeline that creates challenging questions to better evaluate their reasoning capabilities.
Contribution
The paper presents ConMe, a novel CR benchmark and data generation pipeline that uses VLMs to produce hard reasoning questions, addressing limitations of previous benchmarks.
Findings
ConMe causes up to 33% performance decrease in VLMs, revealing their true CR capabilities.
The pipeline autonomously generates and selects challenging CR questions, validated manually.
ConMe effectively exposes weaknesses in state-of-the-art VLMs, reinstating the CR challenge.
Abstract
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElevator Systems and Control · Software System Performance and Reliability · Advanced Software Engineering Methodologies
