Probing the Probes: Methods and Metrics for Concept Alignment
Jacob Lysn{\ae}s-Larsen, Marte Eggen, Inga Str\"umke

TL;DR
This paper critically examines the reliability of probe accuracy in concept alignment for explainable AI, introduces a new spatial attribution method, and proposes metrics for better assessment of concept representation in neural networks.
Contribution
It reveals the limitations of probe accuracy as a measure of concept alignment, introduces a novel spatial attribution technique, and proposes new metrics for evaluating concept alignment.
Findings
Probe accuracy can be misleading due to spurious correlations.
Spatial linear attribution improves concept localization.
Alignment metrics outperform standard accuracy in evaluating probes.
Abstract
In explainable AI, Concept Activation Vectors (CAVs) are typically obtained by training linear classifier probes to detect human-understandable concepts as directions in the activation space of deep neural networks. It is widely assumed that a high probe accuracy indicates a CAV faithfully representing its target concept. However, we show that the probe's classification accuracy alone is an unreliable measure of concept alignment, i.e., the degree to which a CAV captures the intended concept. In fact, we argue that probes are more likely to capture spurious correlations than they are to represent only the intended concept. As part of our analysis, we demonstrate that deliberately misaligned probes constructed to exploit spurious correlations, achieve an accuracy close to that of standard probes. To address this severe problem, we introduce a novel concept localization method based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
