SHLIME: Foiling adversarial attacks fooling SHAP and LIME
Sam Chauhan, Estelle Duguet, Karthik Ramakrishnan, Hugh Van Deventer, Jack Kruger, Ranjan Subbaraman

TL;DR
This paper examines the vulnerability of explanation methods like LIME and SHAP to adversarial bias concealment and proposes robust ensemble strategies to improve transparency in high-stakes machine learning applications.
Contribution
It introduces a modular testing framework for evaluating and enhancing the robustness of explanation methods against adversarial manipulation.
Findings
Certain ensemble configurations significantly improve bias detection.
The framework enables systematic evaluation of explanation robustness.
Enhanced methods better reveal biases in out-of-distribution models.
Abstract
Post hoc explanation methods, such as LIME and SHAP, provide interpretable insights into black-box classifiers and are increasingly used to assess model biases and generalizability. However, these methods are vulnerable to adversarial manipulation, potentially concealing harmful biases. Building on the work of Slack et al. (2020), we investigate the susceptibility of LIME and SHAP to biased models and evaluate strategies for improving robustness. We first replicate the original COMPAS experiment to validate prior findings and establish a baseline. We then introduce a modular testing framework enabling systematic evaluation of augmented and ensemble explanation approaches across classifiers of varying performance. Using this framework, we assess multiple LIME/SHAP ensemble configurations on out-of-distribution models, comparing their resistance to bias concealment against the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
