Benchmarking XAI Explanations with Human-Aligned Evaluations
R\'emi Kazmierczak, Steve Azzolin, Elo\"ise Berthier, Anna Hedstr\"om, Patricia Delhomme, David Filliat, Nicolas Bousquet, Goran Frehse, Massimiliano Mancini, Baptiste Caramiaux, Andrea Passerini, Gianni Franchi

TL;DR
This paper presents PASTA, a human-centric framework and dataset for evaluating XAI methods in computer vision, along with an automated scoring system that predicts human preferences and enables cross-modality comparison.
Contribution
The paper introduces PASTA, a large-scale benchmark dataset and an automated scoring method for human-aligned evaluation of XAI techniques in computer vision.
Findings
PASTA-dataset enables robust comparison of XAI methods
PASTA-score predicts human preferences accurately
Benchmark supports cross-modality explanation evaluation
Abstract
We introduce PASTA (Perceptual Assessment System for explanaTion of Artificial Intelligence), a novel human-centric framework for evaluating eXplainable AI (XAI) techniques in computer vision. Our first contribution is the creation of the PASTA-dataset, the first large-scale benchmark that spans a diverse set of models and both saliency-based and concept-based explanation methods. This dataset enables robust, comparative analysis of XAI techniques based on human judgment. Our second contribution is an automated, data-driven benchmark that predicts human preferences using the PASTA-dataset. This scoring called PASTA-score method offers scalable, reliable, and consistent evaluation aligned with human perception. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. We then propose to apply our scoring method…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper addresses an important problem of benchmarking XAI methods in a comprehensive and efficient way, aligning evaluations with human assessments, which I believe is an important gap in the literature. 2. The proposed benchmark allows for the comparison of XAI methods across different modalities, facilitating evaluations of both explanation-based and saliency-based methods. 3. The paper introduce a data-driven metric to mimic human assessments in evaluating the interpretability of exp
1. How many human subjects were involved in the benchmark creation? Is it five? This information is not explicitly stated in the main paper. Since the benchmark aims to align with human evaluations, it would be valuable to provide details about the annotators, including their ethnicity, gender, age, and other relevant demographics. Ideally, involving annotators from diverse backgrounds—such as varying ages, genders, and ethnicities—would help reduce potential biases in the benchmark. 2. The eva
1. The paper tries to answer an important question about the evaluation of the XAI methods. 2. The overall evaluation protocol and the research questions designed based on the fidelity, complexity, objectivity and robustness are presented in a right manner.
1. First and foremost to perform any sort of evaluation the authors should consider the methodology used and the architecture used specially in gradient based methods. eg: GradCAM does not work well with transformers, rather it is a methodology that works significantly better on CNNs due to the architectural composition. Chefer et. al. Transformer interpretability beyond attention visualization. 2. One of the major limitation of this work is use of another network suck as CLIP as explanations ev
(1) This paper tackles an important challenge in XAI: the alignment between model explanations and human evaluations. (2) It carries out human study with a broad range of explanation methods and multiple datasets, which can facilitate the development of future XAI methods. (3) With an automatic model for estimating human ratings, the study can be extended to help improve the trustworthiness of deep networks. (4) Extensive experiments are provided in the supplementary materials, providing the
(1) The paper emphasizes the limitations of existing studies in scaling to broader domains, due to difficulties of data collection and the subjectivity of human evaluation. Nevertheless, these problems are also well addressed in the proposed method. Specifically, the paper centers around a user study for rating model explanations, which also requires significant amounts of manual labor and does not alleviate annotator biases. While a quantitative evaluation model is proposed, with only the resul
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Advanced Database Systems and Queries
