Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Eun Woo Im, Dhruv Madhwal, Vivek Gupta

TL;DR
This paper introduces Slipform, a framework that enhances vision-language models' compositional reasoning by leveraging lexical concreteness and a novel loss function, achieving state-of-the-art results.
Contribution
It proposes ConcretePlant for manipulating perceptually grounded concepts and Cement loss to balance training, improving contrastive learning for compositional understanding.
Findings
Slipform achieves state-of-the-art accuracy on compositional benchmarks.
Modifying concrete terms yields stronger learning signals.
Cement loss mitigates gradient imbalance in contrastive training.
Abstract
Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
