Visuallly Grounded Generation of Entailments from Premises
Somaye Jafaritazehjani, Albert Gatt, Marc Tanti

TL;DR
This paper explores generating hypotheses from visual premises for natural language inference, demonstrating that multimodal models grounded in visual information can effectively produce entailments, with marginal improvements over unimodal models.
Contribution
It introduces a novel generation-based approach to NLI using visual grounding and compares multimodal and unimodal neural architectures for this task.
Findings
Multimodal models outperform unimodal models in entailment generation.
Generated hypotheses are evaluated successfully through automatic and human assessments.
Grounding textual premises in visual information benefits hypothesis generation.
Abstract
Natural Language Inference (NLI) is the task of determining the semantic relationship between a premise and a hypothesis. In this paper, we focus on the {\em generation} of hypotheses from premises in a multimodal setting, to generate a sentence (hypothesis) given an image and/or its description (premise) as the input. The main goals of this paper are (a) to investigate whether it is reasonable to frame NLI as a generation task; and (b) to consider the degree to which grounding textual premises in visual information is beneficial to generation. We compare different neural architectures, showing through automatic and human evaluation that entailments can indeed be generated successfully. We also show that multimodal models outperform unimodal models in this task, albeit marginally.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
