What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

Hyejin Go; Semi Lee; Hyesong Choi

arXiv:2605.22651·cs.CV·May 22, 2026

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

Hyejin Go, Semi Lee, Hyesong Choi

PDF

TL;DR

This paper introduces Counterfactual Phrase Intervention (CPI), a phrase-level filtering method for vision-language pretraining that enhances compositional generalization by focusing on phrase sensitivity rather than pair-level alignment.

Contribution

CPI provides a novel phrase-level curation framework that improves data selection for vision-language models, outperforming traditional pair-level filtering methods.

Findings

01

CPI improves VL-CheckList-VG Relation scores by +1.91 over the baseline.

02

CPI achieves a 50% data subset that enhances model performance.

03

Applying CPI to NegCLIP further boosts relation scores by +3.84.

Abstract

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.