TL;DR
This paper proposes a framework with automated metrics to evaluate LLM-generated explanations for tabular data, addressing the need for objective assessment methods in XAI narratives and highlighting challenges like hallucinations.
Contribution
It introduces a novel framework and metrics for quantitatively evaluating LLM-generated XAI narratives without human surveys.
Findings
Automated metrics can effectively compare LLMs in generating explanations.
The approach reveals challenges such as hallucinations in LLM explanations.
Metrics help identify differences across datasets and prompt types.
Abstract
A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsShapley Additive Explanations
