Local Explanations and Self-Explanations for Assessing Faithfulness in   black-box LLMs

Christos Fragkathoulas; Odysseas S. Chlapanis

arXiv:2409.13764·cs.CL·September 24, 2024

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Christos Fragkathoulas, Odysseas S. Chlapanis

PDF

TL;DR

This paper presents a new explainability method for large language models that identifies crucial input parts for correct answers and compares them with the model's self-explanations to assess faithfulness.

Contribution

It introduces an efficient local perturbation-based explainability technique and a faithfulness metric, validated on the Natural Questions dataset.

Findings

01

Effective in identifying necessary input components for correct answers

02

Accurately measures faithfulness by comparing crucial parts with self-explanations

03

Demonstrates improved interpretability of LLM decisions

Abstract

This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.