TL;DR
This paper introduces CALeC, a novel method for visual entailment with natural language explanations that enhances semantic alignment and explanation faithfulness by leveraging chunk-aware and lexical constraints.
Contribution
It proposes a unified framework with chunk-aware semantic alignment and lexical constraints to improve reasoning and explanation quality in visual entailment tasks.
Findings
CALeC outperforms existing models in inference accuracy.
It generates more faithful and informative explanations.
Experimental results on three datasets validate its effectiveness.
Abstract
Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
