Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

Hang Gao; Dimitris N. Metaxas

arXiv:2603.21437·cs.CL·March 24, 2026

Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

Hang Gao, Dimitris N. Metaxas

PDF

Open Access

TL;DR

This paper investigates how semantic evolution within texts causes embedding collapse and retrieval issues in transformer models, offering a theoretical framework and a measurable indicator called semantic shift.

Contribution

It introduces the concept of semantic shift as a key factor in embedding pathology, providing a formal measure and empirical validation across models and datasets.

Findings

01

Semantic shift correlates with embedding concentration and retrieval performance.

02

Text length alone does not predict embedding degradation.

03

Semantic shift can diagnose when anisotropy harms retrieval.

Abstract

Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Domain Adaptation and Few-Shot Learning