Prediction hubs are context-informed frequent tokens in LLMs

Beatrix M. G. Nielsen; Iuri Macocco; Marco Baroni

arXiv:2502.10201·cs.CL·October 20, 2025

Prediction hubs are context-informed frequent tokens in LLMs

Beatrix M. G. Nielsen, Iuri Macocco, Marco Baroni

PDF

Open Access

TL;DR

This paper investigates hubness in large language models, revealing that context-informed frequent tokens are not nuisance hubs, but using other distance measures can introduce problematic hubs, affecting analysis and applications.

Contribution

The paper proves that the primary comparison in LLMs does not cause nuisance hubness and empirically distinguishes between beneficial and nuisance hubs in high-dimensional LLM representations.

Findings

01

Context-based hubs are not nuisance hubs in LLMs.

02

Using Euclidean or cosine distance can lead to nuisance hubs.

03

Nuisance hubs can be mitigated with appropriate techniques.

Abstract

Hubness, the tendency for a few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data, often negatively impacting distance-based analysis. As autoregressive large language models (LLMs) operate on high-dimensional representations, we ask whether they are also affected by hubness. We first prove that the only large-scale representation comparison operation performed by LLMs, namely that between context and unembedding vectors to determine continuation probabilities, is not characterized by the concentration of distances phenomenon that typically causes the appearance of nuisance hubness. We then empirically show that this comparison still leads to a high degree of hubness, but the hubs in this case do not constitute a disturbance. They are rather the result of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies