Natural Language Inference Improves Compositionality in Vision-Language Models
Paola Cascante-Bonilla, Yu Hou, Yang Trista Cao, Hal Daum\'e III,, Rachel Rudinger

TL;DR
This paper introduces CECE, a novel NLI-based method that enhances compositional reasoning in vision-language models by generating diverse, semantically consistent sentences, leading to state-of-the-art performance without extra fine-tuning.
Contribution
The paper presents CECE, a new approach leveraging Natural Language Inference to improve compositional reasoning and interpretability in vision-language models, outperforming previous methods.
Findings
Achieves +19.2% on Winoground group score
Achieves +12.9% on EqBen group score
Enhances interpretability and reduces bias reliance
Abstract
Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Semantic Web and Ontologies
