Preserving Task-Relevant Information Under Linear Concept Removal
Floris Holstege, Shauli Ravfogel, Bram Wouters

TL;DR
This paper introduces SPLINCE, a novel method for removing unwanted concepts from neural network representations while preserving their covariance with target labels, improving fairness and interpretability.
Contribution
SPLINCE is the first technique to exactly preserve label covariance while removing linear concept predictability, with theoretical guarantees and empirical superiority.
Findings
Outperforms baselines on Bias in Bios and Winobias benchmarks
Effectively removes protected attributes with minimal main-task information loss
Provides a unique, theoretically justified solution for concept removal
Abstract
Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLINCE-Simultaneous Projection for LINear concept removal and Covariance prEservation - which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLINCE achieves this via an oblique projection that 'splices out' the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLINCE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
