On the Challenges of Evaluating Compositional Explanations in Multi-Hop   Inference: Relevance, Completeness, and Expert Ratings

Peter Jansen; Kelly Smith; Dan Moreno; Huitzilin Ortiz

arXiv:2109.03334·cs.CL·September 9, 2021

On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings

Peter Jansen, Kelly Smith, Dan Moreno, Huitzilin Ortiz

PDF

TL;DR

This paper investigates the challenges in evaluating multi-hop explanations in AI, revealing that current methods underestimate model performance and proposing a large expert-annotated dataset to improve assessment accuracy.

Contribution

The authors create a large expert-annotated dataset and demonstrate that existing evaluation metrics significantly underestimate true explanation quality in multi-hop reasoning.

Findings

01

Expert-augmented ratings improve evaluation accuracy

02

Current automatic metrics underestimate performance by up to 36%

03

Models discover valid explanations beyond gold standards

Abstract

Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these "multi-hop" explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.