Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Kexin Chen; Dongxia Wang; Yi Liu; Haonan Zhang; Wenhai Wang

arXiv:2507.18171·cs.CL·July 25, 2025

Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang

PDF

Open Access 1 Video

TL;DR

This paper identifies and analyzes 'sticky tokens' in Transformer-based text embedding models, introduces an efficient detection method, and demonstrates their negative impact on downstream NLP task performance.

Contribution

It formally defines sticky tokens, presents the Sticky Token Detector (STD), and provides extensive analysis across multiple models revealing their origins and effects.

Findings

01

868 sticky tokens identified across 14 models

02

Sticky tokens cause up to 50% performance degradation

03

Sticky tokens often originate from special or fragmented subwords

Abstract

Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising 'sticky tokens' can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling