Natural Language Inference Improves Compositionality in Vision-Language   Models

Paola Cascante-Bonilla; Yu Hou; Yang Trista Cao; Hal Daum\'e III,; Rachel Rudinger

arXiv:2410.22315·cs.CL·October 30, 2024

Natural Language Inference Improves Compositionality in Vision-Language Models

Paola Cascante-Bonilla, Yu Hou, Yang Trista Cao, Hal Daum\'e III,, Rachel Rudinger

PDF

Open Access 1 Video

TL;DR

This paper introduces CECE, a novel NLI-based method that enhances compositional reasoning in vision-language models by generating diverse, semantically consistent sentences, leading to state-of-the-art performance without extra fine-tuning.

Contribution

The paper presents CECE, a new approach leveraging Natural Language Inference to improve compositional reasoning and interpretability in vision-language models, outperforming previous methods.

Findings

01

Achieves +19.2% on Winoground group score

02

Achieves +12.9% on EqBen group score

03

Enhances interpretability and reduces bias reliance

Abstract

Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Natural Language Inference Improves Compositionality in Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Semantic Web and Ontologies