Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings
Hanna Herasimchyk, Alhassan Abdelhalim, S\"oren Laue, Michaela Regneri

TL;DR
This paper critically examines the common practice of interpreting word embeddings through feature prediction, revealing that such methods often reflect geometric similarity rather than genuine semantic understanding.
Contribution
It demonstrates that feature prediction methods do not reliably indicate semantic knowledge in embeddings and are influenced by geometric properties, challenging prior interpretability assumptions.
Findings
Prediction accuracy does not imply semantic interpretability.
Methods can predict random information, indicating superficial correlations.
Geometric similarity dominates the interpretability signals.
Abstract
Understanding what knowledge is implicitly encoded in deep learning models is essential for improving the interpretability of AI systems. This paper examines common methods to explain the knowledge encoded in word embeddings, which are core elements of large language models (LLMs). These methods typically involve mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Prior work assumes that accurately predicting these semantic features from the word embeddings implies that the embeddings contain the corresponding knowledge. We challenge this assumption by demonstrating that prediction accuracy alone does not reliably indicate genuine feature-based interpretability. We show that these methods can successfully predict even random information, concluding that the results are predominantly determined by an algorithmic upper bound rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
