How Explanations Leak the Decision Logic: Stealing Graph Neural Networks via Explanation Alignment

Bin Ma; Yuyuan Feng; Minhua Lin; Enyan Dai

arXiv:2506.03087·cs.LG·June 4, 2025

How Explanations Leak the Decision Logic: Stealing Graph Neural Networks via Explanation Alignment

Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals security risks in explainable GNNs by demonstrating how explanations can be exploited to steal models, proposing a novel framework that effectively replicates target models' behavior and reasoning patterns.

Contribution

It introduces { ool}, a framework combining explanation alignment and guided data augmentation for efficient model stealing of GNNs using explanation mechanisms.

Findings

01

Our approach outperforms traditional methods in model stealing accuracy.

02

Explanation-based attacks pose significant security threats to GNNs.

03

The work emphasizes the need for protective measures in deploying explainable GNNs.

Abstract

Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to growing demands for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The paper is clearly written and well structured, making it easy to follow the authors' ideas and understand the main contributions. - The figures are well designed and intuitive, effectively illustrating how the proposed method works and helping readers grasp the key concepts at a glance.

Weaknesses

- The paper claims that GNN models are deployed online to protect intellectual property in critical applications (e.g., drug screening). However, in realistic IP-protection scenarios, online ML services are unlikely to provide model explanations, since explanations can reveal internal model logic and sensitive knowledge. - Only explanations that rely on internal model access (e.g., gradient- or attention-based, or self-interpretable models) provided by an online ML service are useful; for post-

Reviewer 02Rating 4Confidence 4

Strengths

**1.** The idea of not only aligning mode classification but also aligning model explanation outcome is interesting. By requiring the surrogate model to have similar importance preference with target model, it may potentially make their inner mechanisms to become closer, achieving a better replication. **2.** The presentation of the paper is clean and sound. The notation in the paper is used clearly and formulations are also expressed tidily. The figures for illustration are easy to understand.

Weaknesses

**1.** The proposed intervened data as augmented data may introduce "incorrectness". In the causal analysis the author assumed the explanation subgraph $G_{E}$ is the part that decides to the classification, while the style graph has little impact. So the author arbitrarily perturb the style graph while holding explanation subgraph to be the same to generate the intervened graph, while given the original label and same explanation subgraphs. However, this may be incorrect in some cases. For exa

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper is well organized; 2. Many statements are supported with experiments; 3. They tackle a novel problem;

Weaknesses

The authors assume that the provided explanation subgraph fully determinate the GNN prediction. However, most post-hoc explanations may not reflect the actual decision logic [1]; besides, even self-explainable GNNs may provide explanations that are different from the underlying subgraphs that actually drived the predictions [2][3]. This issue is called unfaithful explanations, which have been widely recognized. Let me make this comment more actionable: (1) how could you guarantee (or measure)

Code & Models

Repositories

beanmah/egsteal
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning