Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding

Daniil Ignatev; Ayman Santeer; Albert Gatt; Denis Paperno

arXiv:2511.17358·cs.CL·November 24, 2025

Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding

Daniil Ignatev, Ayman Santeer, Albert Gatt, Denis Paperno

PDF

Open Access

TL;DR

This paper introduces a zero-shot NLI approach that grounds language in visual representations, improving robustness and accuracy without task-specific training by comparing visual and textual data.

Contribution

It presents a novel zero-shot NLI method using visual grounding with text-to-image models, demonstrating robustness and bias resistance in natural language inference.

Findings

01

Achieves high accuracy without fine-tuning

02

Demonstrates robustness against textual biases

03

Validates approach with a controlled adversarial dataset

Abstract

We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis