TL;DR
EnsembleSHAP introduces a computationally efficient, robust, and secure feature attribution method specifically designed for the random subspace method, providing provable protection against various explanation-preserving attacks.
Contribution
It is the first to establish a provably robust and faithful feature attribution method tailored for the random subspace approach, reusing its computational byproducts.
Findings
EnsembleSHAP is computationally efficient and maintains local accuracy.
It offers guaranteed protection against privacy-preserving attacks.
Evaluations show effectiveness against backdoor, adversarial, and jailbreak attacks.
Abstract
Random subspace method has wide security applications such as providing certified defenses against adversarial and backdoor attacks, and building robustly aligned LLM against jailbreaking attacks. However, the explanation of random subspace method lacks sufficient exploration. Existing state-of-the-art feature attribution methods, such as Shapley value and LIME, are computationally impractical and lacks security guarantee when applied to random subspace method. In this work, we propose EnsembleSHAP, an intrinsically faithful and secure feature attribution for random subspace method that reuses its computational byproducts. Specifically, our feature attribution method is 1) computationally efficient, 2) maintains essential properties of effective feature attribution (such as local accuracy), and 3) offers guaranteed protection against privacy-preserving attacks on feature attribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
