A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Zexi Jia; Chuanwei Huang; Hongyan Fei; Yeshuang Zhu; Zhiqiang Yuan; Ying Deng; Jiapei Zhang; Jinchao Zhang; Jie Zhou

arXiv:2507.04699·cs.CV·July 8, 2025

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, Jie Zhou

PDF

TL;DR

This paper introduces a novel diffusion-based method to automatically generate high-quality counterfactual datasets for improving compositional reasoning in vision-language models, achieving state-of-the-art results with less data.

Contribution

It presents a block-based diffusion approach utilizing large language models to create diverse counterfactual image-text pairs without manual annotation.

Findings

01

Significant improvement in visual reasoning performance after fine-tuning with generated datasets.

02

Achieves state-of-the-art results on multiple benchmarks.

03

Reduces training data requirements compared to existing methods.

Abstract

Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as "puzzle pieces" coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.