On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings
Peter Jansen, Kelly Smith, Dan Moreno, Huitzilin Ortiz

TL;DR
This paper investigates the challenges in evaluating multi-hop explanations in AI, revealing that current methods underestimate model performance and proposing a large expert-annotated dataset to improve assessment accuracy.
Contribution
The authors create a large expert-annotated dataset and demonstrate that existing evaluation metrics significantly underestimate true explanation quality in multi-hop reasoning.
Findings
Expert-augmented ratings improve evaluation accuracy
Current automatic metrics underestimate performance by up to 36%
Models discover valid explanations beyond gold standards
Abstract
Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these "multi-hop" explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
