A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions
Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian

TL;DR
This paper introduces a novel backdoor-based benchmark for evaluating attribution methods in explainable AI, addressing the challenge of assessing faithfulness without ground truth and providing a standardized, fair evaluation framework.
Contribution
It proposes a new benchmark (BackX) that meets fidelity criteria, establishes its theoretical advantages, and offers a standardized setup for fair comparison of attribution methods.
Findings
BackX outperforms existing benchmarks in attribution evaluation.
The standardized setup reduces confounding factors in benchmarking.
Insights into neural Trojan defense using attribution methods.
Abstract
Attribution methods compute importance scores for input features to explain model predictions. However, assessing the faithfulness of these methods remains challenging due to the absence of attribution ground truth to model predictions. In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill, thereby facilitating a systematic assessment of attribution benchmarks. Next, we introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria. We theoretically establish the superiority of our approach over the existing benchmarks for well-founded attribution evaluation. With extensive analysis, we further establish a standardized evaluation setup that mitigates confounding factors such as post-processing techniques and explained predictions, thereby ensuring a fair and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The four fidelity criteria laid out in Section 2.1 are a valuable contribution. They provide a clear language and a useful lens through which to analyze and critique *all* XAI evaluation methods, not just this one. 2. The authors have tested nine different attribution methods across three datasets (including the large-scale ImageNet) and (in the appendix) against five different attack types, including non-trivial ones like WaNet's geometric warping. 3. Section 4 is perhaps the paper's st
1. My main reservation is that the core idea feels more like a significant incremental improvement than a new paradigm. The use of backdoor triggers as a ground truth for XAI evaluation has been proposed before (e.g., Lin et al., 2021). While BackX is certainly a more comprehensive and well-formalized framework, it feels like an extensive validation and engineering effort built on a known concept, rather than a fundamental conceptual leap. 2. The paper identifies that methods like Grad-CAM an
The paper is clearly written and frames the considerations that should be taken when measuring faithfulness well. There is lots of discussion about the design choices of the benchmark, and the presentation overall is good. The benchmark is tested on different methods, and shows promise in measuring their faithfulness. I find the premise they propose for measuring the faithfulness clever and well thought out.
While I do see the value in this benchmark, I feel that the way it is currently framed in the paper paints it as more powerful than what I understand it to be. If these points can be addressed, I would be open to raising my score. 1. While the authors claim that their benchmark does not require a shift to the model, this is only true if the goal is to explain a model that has a backdoor. For models without this, the benchmark will require that a backdoor is added to them. This is a significant
1. Paper analyzes the impact of post-processing and output choice, is a valuable contribution. 2. The evaluation on image-based tasks is thorough, covering multiple datasets, architectures, and a good variety of trigger types.
1. My primary concern is that the core conceptual contribution of this work is limited. The idea of using model Trojans to create a ground truth for evaluating explanations is not new and builds directly on prior work (e.g., Lin et al., 2021). The paper's contribution feels more like an incremental consolidation and a more thorough implementation of this idea, rather than a fundamentally new paradigm. The "four foundational criteria," while a useful formalization, are largely a consolidation of
1. The research area is highly interesting and possesses significant practical application value. 2. The writing exhibits a logical flow of ideas, and the notation is clearly defined. 3. The research is comprehensive and robust. It proposes four essential criteria that a high-quality benchmark should fulfill and demonstrates that BackX largely meets these criteria. Subsequently, an extensive evaluation of various attribution methods is conducted based on this foundation. 4. The benchmark design
1. The four fidelity criteria introduced in Section 2 are explained at a highly abstract level, making them difficult to grasp. For instance, the example `M = M_X + M_Y`used to illustrate "Attribution Verifiability" is confusing. The subsequent explanation seems to focus on preventing the mask S from introducing extra mutual information with the class, which makes the purpose of the `M_X + M_Y` decomposition unclear. It would be helpful to explain these concepts using concrete examples from exis
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification
MethodsSparse Evolutionary Training
