TL;DR
CLEVR-X is a large-scale dataset that extends CLEVR with natural language explanations for visual question answering, enabling better understanding and evaluation of explanation generation models.
Contribution
The paper introduces CLEVR-X, a new dataset with structured explanations for VQA, and provides baseline models and analysis for explanation generation.
Findings
Ground-truth explanations are complete and relevant.
Baseline models achieve measurable explanation quality.
Using more explanations improves NLG metric convergence.
Abstract
Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
