Model extraction from counterfactual explanations
Ulrich A\"ivodji, Alexandre Bolot, S\'ebastien Gambs

TL;DR
This paper shows how counterfactual explanations, used for interpreting black-box models, can be exploited by adversaries to perform high-fidelity model extraction attacks, raising privacy concerns.
Contribution
It introduces a novel attack leveraging counterfactual explanations to accurately replicate black-box models, highlighting privacy vulnerabilities.
Findings
High-fidelity model extraction achievable with limited queries
Counterfactual explanations leak significant model information
Attack effective on real-world datasets
Abstract
Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
