Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Daniil Ignatev, Ayman Santeer, Albert Gatt, Denis Paperno

TL;DR
This paper introduces a zero-shot NLI approach that grounds language in visual representations, improving robustness and accuracy without task-specific training by comparing visual and textual data.
Contribution
It presents a novel zero-shot NLI method using visual grounding with text-to-image models, demonstrating robustness and bias resistance in natural language inference.
Findings
Achieves high accuracy without fine-tuning
Demonstrates robustness against textual biases
Validates approach with a controlled adversarial dataset
Abstract
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
