B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data

Magdalena Proszewska; Tomasz Danel; Dawid Rymarczyk

arXiv:2505.22252·cs.LG·May 29, 2025

B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data

Magdalena Proszewska, Tomasz Danel, Dawid Rymarczyk

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces B-XAIC, a benchmark dataset based on real-world molecular data, to evaluate and improve the faithfulness of explainable AI methods for graph neural networks in cheminformatics.

Contribution

The paper presents B-XAIC, a new benchmark dataset with ground-truth rationales for molecular data, enabling more realistic evaluation of XAI methods for GNNs.

Findings

01

Existing XAI methods show limitations in faithfulness on B-XAIC.

02

B-XAIC provides a more realistic evaluation environment for GNN explanations.

03

The benchmark helps identify gaps in current explainability techniques.

Abstract

Understanding the reasoning behind deep learning model predictions is crucial in cheminformatics and drug discovery, where molecular design determines their properties. However, current evaluation frameworks for Explainable AI (XAI) in this domain often rely on artificial datasets or simplified tasks, employing data-derived metrics that fail to capture the complexity of real-world scenarios and lack a direct link to explanation faithfulness. To address this, we introduce B-XAIC, a novel benchmark constructed from real-world molecular data and diverse tasks with known ground-truth rationales for assigned labels. Through a comprehensive evaluation using B-XAIC, we reveal limitations of existing XAI methods for Graph Neural Networks (GNNs) in the molecular domain. This benchmark provides a valuable resource for gaining deeper insights into the faithfulness of XAI, facilitating the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

## Strengths 1) **Real-world scale with ground-truth rationales** - ~50k molecules across 7 chemically meaningful tasks, each with node/edge-level rationales. This moves beyond synthetic motifs and enables fine-grained, objective scoring of explanations. 2) **Two-regime evaluation that targets common XAI failure modes** - Separates **Null Explanations (NE)** (nothing should be highlighted) from **Subgraph Explanations (SE)** (specific atoms/bonds matter), encouraging both *specificity* (

Weaknesses

1) **F1 definition is ambiguous** - The paper reports “F1” but does not specify **micro vs. macro** (or weighted) averaging. - **Why it matters:** Micro-F1 can be dominated by frequent classes, while Macro-F1 reflects per-class balance; conclusions about model/explainer performance can flip depending on this choice. - **Fix:** Report both Micro- and Macro-F1 (plus class-wise F1), or clearly state and justify the chosen averaging scheme. 2) **Missing presentation of extracted subgraphs*

Reviewer 02Rating 2Confidence 5

Strengths

Explainability is a difficult problem and comparing different explainability methods is even harder. Benchmarks that may offer an unbiased comparison can be useful. The benchmark is a useful characterization of the common tasks in the chemical space.

Weaknesses

There are a number of recent explainability methods that the paper could have included. For example, those in Fig 1 from https://arxiv.org/pdf/2306.01958 or from Fig 1 in https://arxiv.org/pdf/2310.01794. These methods include GSAT, DIR, SubgraphX, VGIB, GIB and others. Also, this work on substructure masking: https://www.nature.com/articles/s41467-023-38192-3. The paper discusses factual as well as counterfactual explainers. It is unclear if the proposed benchmarks will work for counterfactual

Reviewer 03Rating 2Confidence 5

Strengths

The distinction between the subgraph explanations and the null explanations adds an interesting nuance to xAI evaluations. The authors correctly point out that sometimes it is equally important to quantify if the correct explanation has been found as it is to make sure that no incorrect explanations are generated if they are not warranted. The paper presents a comprehensive empirical evaluation of various common explainers as well as various common graph neural network architectures.

Weaknesses

- The same general idea was presented 5 years ago by Sanchez-Lengeling in their work on "Evaluating Attribution for Graph Neural Networks". In their work, Sanchez-Lengeling et al. also propose various subgraph detection-based molecular classification tasks with the explicit goal of benchmarking graph explainability methods regarding the known subgraph masks. In addition, they also take the possibility of a regression task into consideration. - Since this work presents essentially the same schem

Code & Models

Repositories

mproszewska/B-XAIC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Machine Learning in Materials Science · Computational Drug Discovery Methods