e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde,, Virginie Do, Zeynep Akata, Thomas Lukasiewicz

TL;DR
e-ViL introduces a comprehensive benchmark and dataset for evaluating natural language explanations in vision-language tasks, enabling fair comparison of models and advancing explainability in multimodal AI.
Contribution
It provides the first unified evaluation framework, a large dataset e-SNLI-VE, and a new model combining UNITER and GPT-2 that outperforms previous methods.
Findings
e-ViL enables systematic comparison of NLE models.
e-SNLI-VE is the largest VL dataset with NLEs.
The proposed UNITER-GPT-2 model surpasses previous state-of-the-art results.
Abstract
Recently, there has been an increasing number of efforts to introduce models capable of generating natural language explanations (NLEs) for their predictions on vision-language (VL) tasks. Such models are appealing, because they can provide human-friendly and comprehensive explanations. However, there is a lack of comparison between existing methods, which is due to a lack of re-usable evaluation frameworks and a scarcity of datasets. In this work, we introduce e-ViL and e-SNLI-VE. e-ViL is a benchmark for explainable vision-language tasks that establishes a unified evaluation framework and provides the first comprehensive comparison of existing approaches that generate NLEs for VL tasks. It spans four models and three datasets and both automatic metrics and human evaluation are used to assess model-generated explanations. e-SNLI-VE is currently the largest existing VL dataset with NLEs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · UNiversal Image-TExt Representation Learning · Cosine Annealing · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · Discriminative Fine-Tuning
