Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval
Hang Gao, Dimitris N. Metaxas

TL;DR
This paper investigates how semantic evolution within texts causes embedding collapse and retrieval issues in transformer models, offering a theoretical framework and a measurable indicator called semantic shift.
Contribution
It introduces the concept of semantic shift as a key factor in embedding pathology, providing a formal measure and empirical validation across models and datasets.
Findings
Semantic shift correlates with embedding concentration and retrieval performance.
Text length alone does not predict embedding degradation.
Semantic shift can diagnose when anisotropy harms retrieval.
Abstract
Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Domain Adaptation and Few-Shot Learning
