Watermarking Counterfactual Explanations
Hangzhi Guo, Firdaus Ahmed Choudhury, Tinghua Chen, Amulya Yadav

TL;DR
This paper introduces CFMark, a watermarking framework that embeds detectable watermarks into counterfactual explanations to identify unauthorized model extraction attacks without compromising explanation quality.
Contribution
We propose a novel bi-level optimization-based watermarking method for CF explanations that enables secure detection of model theft attacks while maintaining explanation fidelity.
Findings
CFMark achieves an F-1 score of ~0.89 in attack detection.
Watermarking causes only ~1.3% degradation in explanation validity.
The framework is effective across diverse datasets and attack techniques.
Abstract
Counterfactual (CF) explanations for ML model predictions provide actionable recourse recommendations to individuals adversely impacted by predicted outcomes. However, despite being preferred by end-users, CF explanations have been shown to pose significant security risks in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on the underlying proprietary ML model. To address this security challenge, we propose CFMark, a novel model-agnostic watermarking framework for detecting unauthorized model extraction attacks relying on CF explanations. CFMark involves a novel bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks using these watermarked CF explanations can be detected using a null hypothesis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Digital and Cyber Forensics · Digital Media Forensic Detection
MethodsSparse Evolutionary Training
