Bridging Explainability and Embeddings: BEE Aware of Spuriousness
Cristian Daniel P\u{a}duraru, Antonio B\u{a}rb\u{a}lau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena Burceanu

TL;DR
BEE is a framework that analyzes embedding geometry and weight space perturbations to uncover hidden spurious correlations in models, improving dataset auditing and model trustworthiness.
Contribution
BEE introduces a novel diagnostic approach using linear probing on embedding space to detect persistent and transferable spurious correlations across models and domains.
Findings
Uncovers hidden spurious correlations in vision and language models.
Reveals concepts that significantly reduce ImageNet accuracy.
Detects clinical shortcuts causing false negatives in medical notes.
Abstract
Current methods for detecting spurious correlations rely on analyzing dataset statistics or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space, and to the embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Our experiments cover numerous datasets and domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR medical notes), and multiple embedding families…
Peer Reviews
Decision·ICLR 2026 Poster
1. The authors showcase a wide range of use cases, including zero-shot classification, discovering spurious correlations within datasets, and applications to both text-based and image-based datasets.
1. The proposed method closely resembles the concepts of Label-free CBM [1] and Post-hoc CBM [2], both of which also utilize CLIP-based image /text encoders. In particular, Post-hoc CBM (see Table 10 in its Appendix) demonstrates a similar approach to identify biased concepts residing in the dataset. While the authors could emphasize the spurious concept discovery component as their main contribution, the process of using LLMs and captioning models to enumerate and filter potential concepts appe
1. Core idea is clear: both the linear head and concepts are in the same embedding space. Direct comparison of learned weights and concept embeddings constitutes a direct probe into the decision mechanisms of the classifier. 2. Efficient as it eliminates the need for expensive backbone retraining or the construction of elaborate counterexample data splits. 3. Empirical validation across multiple modalities / datasets and different foundational embedding families (e.g., CLIP and BLIP-2). 4. Image
1. Missing implementation details for reproducibility about the construction of the concept pool (prompts used with Llama-3.1-8B-Instruct, the exact filtering and de-duplication rules applied with WordNet and the final vocabulary size for each dataset). 2. BEE has a variable number of concept prompts per class. No details on whether the baselines like B2T / SpLiCE were given an equivalent prompt budget - difficult to ascertain if the observed gains are because of the higher quality of the BEE se
- a conceptually elegant and model-agnostic approach to uncover spurious correlations from classifier weights, which is both simple and broadly applicable. - tightly aligned with foundation models: BEE leverages their shared embedding spaces to analyze both classifier weights and textual concepts. This makes BEE natively compatible with large pre-trained models, including CLIP, BLIP2, mGTE, and others. - strong empirical results on diverse tasks and datasets. - the methods provides explicit
- BEE operates entirely within the embedding space of large foundation models such as CLIP or mGTE. If the embedding model itself has already encoded biased or spurious associations, BEE may merely surface these existing biases, rather than revealing new or independent shortcuts learned during downstream training. This raises the question of whether BEE is diagnosing the fine-tuning process or simply interpreting the biases already present in the frozen embeddings. - BEE relies on linear probin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistics Education and Methodologies · Qualitative Comparative Analysis Research
MethodsContrastive Language-Image Pre-training
