Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models
Cristian Sbrolli, Matteo Matteucci, Toshihiko Yamasaki

TL;DR
Auto-Comp introduces an automated pipeline for creating scalable, controllable benchmarks to evaluate and analyze the compositional reasoning abilities of vision-language models, revealing universal flaws and complex interactions between context and attribute binding.
Contribution
We develop Auto-Comp, a fully automated synthetic pipeline that generates controlled benchmarks for dissecting visio-linguistic reasoning in VLMs, enabling detailed analysis of their compositional failures.
Findings
VLMs exhibit universal compositional failures in color and spatial reasoning.
Models are highly susceptible to low-entropy distractors, revealing deeper flaws.
Global scene context can both aid and hinder local attribute binding.
Abstract
Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing "a red cube and a blue sphere" with "a blue cube and a red sphere". Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., "a monitor to the left of a bicycle on a white background") and LLM-generated Contextual captions (e.g., "In a brightly lit photography studio, a monitor is positioned to the left of a bicycle"), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs…
Peer Reviews
Decision·Submitted to ICLR 2026
S1: This paper is well-written and easy to understand S2: The investigated problem of benchmarking the compositional understanding of VLMs is important and interesting S3: The proposed pipeline for benchmark construction is well designed S4: The results and findings are interesting. In particular, models are highly susceptible to low-entropy distractors, showing their compositional failures extend beyond known bag-of-words limitations.
W1: The benchmark generation relies on the capabilities of Gemma3-12b, StableDiffusion3.5-large, and GroundedSAM2. I am curious whether using other models could achieve similar (or even better) benchmark quality? In other words, does the automatic benchmark generation pipeline specifically work for this combination of models, or is it generalizable to stronger ones to be developed in the future? W2: A related concern is that the capabilities of each model in doing the corresponding tasks shoul
The automated, concept-driven pipeline is well-structured. Auto-Comp can generate vast, high-quality benchmarks without manual labeling. The open-source data and code ensure reproducibility and community impact. The paper evaluates a wide range of models, systematically analyzing error types, context effects, and model hierarchies.
The benchmark currently focuses only on color binding and spatial relations. While sufficient for proof-of-concept, generalization to other compositional phenomena, such as actions and attributes, remains untested. Could your benchmark pipeline incorporate more aspects? Since Auto-Comp uses pretrained T2I models and LLM validators, biases in those systems propagate into the benchmark. Could you provide some insights or discussions on how to minimize the impact of external models on the benchmar
- Evaluating models on images containing the same objects but with different backgrounds (achieved by the "Minimal" and "Contextual" conditions) is novel and leads to a fairly surprising result in the form of models improving in performance when a realistic background is used for hard negatives featuring spatial relations. - The automated pipeline makes clever use of various open-source resources for both the generation and the filtering components and achieves strong agreement with human judgem
I would argue the paper is affected by two key limitations: - Firstly, I was surprised to see that the evaluation is restricted to VLMs of the CLIP and SigLIP families. In the context of contrastive vision-language models, I would have expected, for instance, to see NegCLIP, which is finetuned on hard-negatives. More importantly, however, the landscape of vision-language models today is not restricted to contrastive vision-language embedding models, but features numerous models, both open and pr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
