Probing Classifiers are Unreliable for Concept Removal and Detection
Abhinav Kumar, Chenhao Tan, Amit Sharma

TL;DR
This paper critically examines the reliability of probing classifiers in removing and detecting concepts in neural network models, revealing their limitations and potential to fail or harm task performance.
Contribution
It provides a theoretical analysis showing probing classifiers can be unreliable for concept removal, supported by empirical experiments across multiple datasets.
Findings
Probing classifiers often use non-concept features, leading to ineffective concept removal.
Post-hoc and adversarial methods can destroy task-relevant information.
A new spuriousness metric is proposed to evaluate classifier quality.
Abstract
Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning
