B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data
Magdalena Proszewska, Tomasz Danel, Dawid Rymarczyk

TL;DR
This paper introduces B-XAIC, a benchmark dataset based on real-world molecular data, to evaluate and improve the faithfulness of explainable AI methods for graph neural networks in cheminformatics.
Contribution
The paper presents B-XAIC, a new benchmark dataset with ground-truth rationales for molecular data, enabling more realistic evaluation of XAI methods for GNNs.
Findings
Existing XAI methods show limitations in faithfulness on B-XAIC.
B-XAIC provides a more realistic evaluation environment for GNN explanations.
The benchmark helps identify gaps in current explainability techniques.
Abstract
Understanding the reasoning behind deep learning model predictions is crucial in cheminformatics and drug discovery, where molecular design determines their properties. However, current evaluation frameworks for Explainable AI (XAI) in this domain often rely on artificial datasets or simplified tasks, employing data-derived metrics that fail to capture the complexity of real-world scenarios and lack a direct link to explanation faithfulness. To address this, we introduce B-XAIC, a novel benchmark constructed from real-world molecular data and diverse tasks with known ground-truth rationales for assigned labels. Through a comprehensive evaluation using B-XAIC, we reveal limitations of existing XAI methods for Graph Neural Networks (GNNs) in the molecular domain. This benchmark provides a valuable resource for gaining deeper insights into the faithfulness of XAI, facilitating the…
Peer Reviews
Decision·Submitted to ICLR 2026
## Strengths 1) **Real-world scale with ground-truth rationales** - ~50k molecules across 7 chemically meaningful tasks, each with node/edge-level rationales. This moves beyond synthetic motifs and enables fine-grained, objective scoring of explanations. 2) **Two-regime evaluation that targets common XAI failure modes** - Separates **Null Explanations (NE)** (nothing should be highlighted) from **Subgraph Explanations (SE)** (specific atoms/bonds matter), encouraging both *specificity* (
1) **F1 definition is ambiguous** - The paper reports “F1” but does not specify **micro vs. macro** (or weighted) averaging. - **Why it matters:** Micro-F1 can be dominated by frequent classes, while Macro-F1 reflects per-class balance; conclusions about model/explainer performance can flip depending on this choice. - **Fix:** Report both Micro- and Macro-F1 (plus class-wise F1), or clearly state and justify the chosen averaging scheme. 2) **Missing presentation of extracted subgraphs*
Explainability is a difficult problem and comparing different explainability methods is even harder. Benchmarks that may offer an unbiased comparison can be useful. The benchmark is a useful characterization of the common tasks in the chemical space.
There are a number of recent explainability methods that the paper could have included. For example, those in Fig 1 from https://arxiv.org/pdf/2306.01958 or from Fig 1 in https://arxiv.org/pdf/2310.01794. These methods include GSAT, DIR, SubgraphX, VGIB, GIB and others. Also, this work on substructure masking: https://www.nature.com/articles/s41467-023-38192-3. The paper discusses factual as well as counterfactual explainers. It is unclear if the proposed benchmarks will work for counterfactual
The distinction between the subgraph explanations and the null explanations adds an interesting nuance to xAI evaluations. The authors correctly point out that sometimes it is equally important to quantify if the correct explanation has been found as it is to make sure that no incorrect explanations are generated if they are not warranted. The paper presents a comprehensive empirical evaluation of various common explainers as well as various common graph neural network architectures.
- The same general idea was presented 5 years ago by Sanchez-Lengeling in their work on "Evaluating Attribution for Graph Neural Networks". In their work, Sanchez-Lengeling et al. also propose various subgraph detection-based molecular classification tasks with the explicit goal of benchmarking graph explainability methods regarding the known subgraph masks. In addition, they also take the possibility of a regression task into consideration. - Since this work presents essentially the same schem
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Machine Learning in Materials Science · Computational Drug Discovery Methods
