TL;DR
This paper highlights the limitations of current explainable question answering models and evaluation metrics, proposing a hierarchical model and new scores to better align with user needs and improve answer-explanation coupling.
Contribution
It introduces a hierarchical model with a regularization term and new evaluation scores to enhance answer-explanation coupling in explainable QA systems.
Findings
Models improve users' ability to judge correctness
F1 score is insufficient for practical usefulness
New scores better align with user experience
Abstract
Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
