A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
Zhijie Nie, Richong Zhang, Zhanyu Wu

TL;DR
This paper reveals that text embeddings from large language models inherently align with key tokens in the input, enabling efficient retrieval and offering new insights into semantic understanding.
Contribution
The study uncovers a universal phenomenon across LLM-based embedders where embeddings align with key tokens, and demonstrates practical applications like sparse retrieval methods.
Findings
Embedding space is mainly affected in the first principal component.
Adjusting the first principal component aligns embeddings with key tokens.
Sparse retrieval achieves 80% of dense retrieval performance.
Abstract
Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the LLM-based embedder, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight LLM-based embedders and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we find that the main change in embedding space between these embedders and their LLM backbones is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLaw, AI, and Intellectual Property · Artificial Intelligence in Law
MethodsALIGN
