Probing Classifiers are Unreliable for Concept Removal and Detection

Abhinav Kumar; Chenhao Tan; Amit Sharma

arXiv:2207.04153·cs.LG·June 21, 2023·5 cites

Probing Classifiers are Unreliable for Concept Removal and Detection

Abhinav Kumar, Chenhao Tan, Amit Sharma

PDF

Open Access 1 Video

TL;DR

This paper critically examines the reliability of probing classifiers in removing and detecting concepts in neural network models, revealing their limitations and potential to fail or harm task performance.

Contribution

It provides a theoretical analysis showing probing classifiers can be unreliable for concept removal, supported by empirical experiments across multiple datasets.

Findings

01

Probing classifiers often use non-concept features, leading to ineffective concept removal.

02

Post-hoc and adversarial methods can destroy task-relevant information.

03

A new spuriousness metric is proposed to evaluate classifier quality.

Abstract

Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Probing Classifiers are Unreliable for Concept Removal and Detection· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning