The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations
Aparna Balagopalan, Haoran Zhang, Kimia Hamidieh, Thomas Hartvigsen,, Frank Rudzicz, Marzyeh Ghassemi

TL;DR
This paper evaluates the fairness of post-hoc explanation methods across different subgroups in sensitive domains, revealing significant fidelity gaps and emphasizing the need for transparent communication of explanation fairness issues.
Contribution
It provides the first systematic audit of explanation fairness across multiple domains, models, and methods, highlighting fidelity disparities and proposing ways to improve explanation fairness.
Findings
Fidelity of explanations varies significantly between subgroups.
Pairing explainability with robust ML can improve fairness.
Unfair explanations pose a critical, understudied challenge.
Abstract
Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which are not transparent to users. Post-hoc explainability methods where a simple, human-interpretable model imitates the behavior of these blackbox models are often proposed to help users trust model predictions. In this work, we audit the quality of such explanations for different protected subgroups using real data from four settings in finance, healthcare, college admissions, and the US justice system. Across two different blackbox model architectures and four popular explainability methods, we find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups. We also demonstrate that pairing explainability methods with recent advances in robust machine learning can improve explanation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
