Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs
Christos Fragkathoulas, Odysseas S. Chlapanis

TL;DR
This paper presents a new explainability method for large language models that identifies crucial input parts for correct answers and compares them with the model's self-explanations to assess faithfulness.
Contribution
It introduces an efficient local perturbation-based explainability technique and a faithfulness metric, validated on the Natural Questions dataset.
Findings
Effective in identifying necessary input components for correct answers
Accurately measures faithfulness by comparing crucial parts with self-explanations
Demonstrates improved interpretability of LLM decisions
Abstract
This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
