Benchmarking XAI Explanations with Human-Aligned Evaluations

R\'emi Kazmierczak; Steve Azzolin; Elo\"ise Berthier; Anna Hedstr\"om; Patricia Delhomme; David Filliat; Nicolas Bousquet; Goran Frehse; Massimiliano Mancini; Baptiste Caramiaux; Andrea Passerini; Gianni Franchi

arXiv:2411.02470·cs.CV·August 27, 2025·2 cites

Benchmarking XAI Explanations with Human-Aligned Evaluations

R\'emi Kazmierczak, Steve Azzolin, Elo\"ise Berthier, Anna Hedstr\"om, Patricia Delhomme, David Filliat, Nicolas Bousquet, Goran Frehse, Massimiliano Mancini, Baptiste Caramiaux, Andrea Passerini, Gianni Franchi

PDF

Open Access 3 Reviews

TL;DR

This paper presents PASTA, a human-centric framework and dataset for evaluating XAI methods in computer vision, along with an automated scoring system that predicts human preferences and enables cross-modality comparison.

Contribution

The paper introduces PASTA, a large-scale benchmark dataset and an automated scoring method for human-aligned evaluation of XAI techniques in computer vision.

Findings

01

PASTA-dataset enables robust comparison of XAI methods

02

PASTA-score predicts human preferences accurately

03

Benchmark supports cross-modality explanation evaluation

Abstract

We introduce PASTA (Perceptual Assessment System for explanaTion of Artificial Intelligence), a novel human-centric framework for evaluating eXplainable AI (XAI) techniques in computer vision. Our first contribution is the creation of the PASTA-dataset, the first large-scale benchmark that spans a diverse set of models and both saliency-based and concept-based explanation methods. This dataset enables robust, comparative analysis of XAI techniques based on human judgment. Our second contribution is an automated, data-driven benchmark that predicts human preferences using the PASTA-dataset. This scoring called PASTA-score method offers scalable, reliable, and consistent evaluation aligned with human perception. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. We then propose to apply our scoring method…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper addresses an important problem of benchmarking XAI methods in a comprehensive and efficient way, aligning evaluations with human assessments, which I believe is an important gap in the literature. 2. The proposed benchmark allows for the comparison of XAI methods across different modalities, facilitating evaluations of both explanation-based and saliency-based methods. 3. The paper introduce a data-driven metric to mimic human assessments in evaluating the interpretability of exp

Weaknesses

1. How many human subjects were involved in the benchmark creation? Is it five? This information is not explicitly stated in the main paper. Since the benchmark aims to align with human evaluations, it would be valuable to provide details about the annotators, including their ethnicity, gender, age, and other relevant demographics. Ideally, involving annotators from diverse backgrounds—such as varying ages, genders, and ethnicities—would help reduce potential biases in the benchmark. 2. The eva

Reviewer 02Rating 5Confidence 2

Strengths

1. The paper tries to answer an important question about the evaluation of the XAI methods. 2. The overall evaluation protocol and the research questions designed based on the fidelity, complexity, objectivity and robustness are presented in a right manner.

Weaknesses

1. First and foremost to perform any sort of evaluation the authors should consider the methodology used and the architecture used specially in gradient based methods. eg: GradCAM does not work well with transformers, rather it is a methodology that works significantly better on CNNs due to the architectural composition. Chefer et. al. Transformer interpretability beyond attention visualization. 2. One of the major limitation of this work is use of another network suck as CLIP as explanations ev

Reviewer 03Rating 6Confidence 4

Strengths

(1) This paper tackles an important challenge in XAI: the alignment between model explanations and human evaluations. (2) It carries out human study with a broad range of explanation methods and multiple datasets, which can facilitate the development of future XAI methods. (3) With an automatic model for estimating human ratings, the study can be extended to help improve the trustworthiness of deep networks. (4) Extensive experiments are provided in the supplementary materials, providing the

Weaknesses

(1) The paper emphasizes the limitations of existing studies in scaling to broader domains, due to difficulties of data collection and the subjectivity of human evaluation. Nevertheless, these problems are also well addressed in the proposed method. Specifically, the paper centers around a user study for rating model explanations, which also requires significant amounts of manual labor and does not alleviate annotator biases. While a quantitative evaluation model is proposed, with only the resul

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Advanced Database Systems and Queries