TL;DR
This paper introduces a new metric to evaluate the stability of explanations in AI models, specifically assessing whether attribution patterns remain consistent across similar inputs, thereby improving trustworthiness.
Contribution
It proposes a novel explanation consistency metric using cosine similarity of SHAP values and demonstrates its effectiveness on transformer-based sentiment analysis models.
Findings
The metric can identify inconsistent model explanations effectively.
Experiments show the metric detects deviations from intended behavior.
The approach enhances understanding of model rationale stability.
Abstract
Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
