Evaluating Readability and Faithfulness of Concept-based Explanations
Meng Li, Haoran Jin, Ruixuan Huang, Zhihao Xu, Defu Lian, Zijia Lin,, Di Zhang, Xiting Wang

TL;DR
This paper proposes a formal framework for evaluating the faithfulness and readability of concept-based explanations in large language models, addressing challenges in their non-local and high-dimensional nature.
Contribution
It introduces a unified formalization of concepts, a perturbation-based faithfulness measure, and an automatic readability metric, along with a meta-evaluation method for explanation assessment.
Findings
Quantifies faithfulness through optimized perturbations in high-dimensional space.
Provides an automatic measure for readability based on pattern coherence.
Conducts extensive experiments to guide evaluation measure selection.
Abstract
With the growing popularity of general-purpose Large Language Models (LLMs), comes a need for more global explanations of model behaviors. Concept-based explanations arise as a promising avenue for explaining high-level patterns learned by LLMs. Yet their evaluation poses unique challenges, especially due to their non-local nature and high dimensional representation in a model's hidden space. Current methods approach concepts from different perspectives, lacking a unified formalization. This makes evaluating the core measures of concepts, namely faithfulness or readability, challenging. To bridge the gap, we introduce a formal definition of concepts generalizing to diverse concept-based explanations' settings. Based on this, we quantify the faithfulness of a concept explanation via perturbation. We ensure adequate perturbation in the high-dimensional space for different concepts via an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
