Prediction hubs are context-informed frequent tokens in LLMs
Beatrix M. G. Nielsen, Iuri Macocco, Marco Baroni

TL;DR
This paper investigates hubness in large language models, revealing that context-informed frequent tokens are not nuisance hubs, but using other distance measures can introduce problematic hubs, affecting analysis and applications.
Contribution
The paper proves that the primary comparison in LLMs does not cause nuisance hubness and empirically distinguishes between beneficial and nuisance hubs in high-dimensional LLM representations.
Findings
Context-based hubs are not nuisance hubs in LLMs.
Using Euclidean or cosine distance can lead to nuisance hubs.
Nuisance hubs can be mitigated with appropriate techniques.
Abstract
Hubness, the tendency for a few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data, often negatively impacting distance-based analysis. As autoregressive large language models (LLMs) operate on high-dimensional representations, we ask whether they are also affected by hubness. We first prove that the only large-scale representation comparison operation performed by LLMs, namely that between context and unembedding vectors to determine continuation probabilities, is not characterized by the concentration of distances phenomenon that typically causes the appearance of nuisance hubness. We then empirically show that this comparison still leads to a high degree of hubness, but the hubs in this case do not constitute a disturbance. They are rather the result of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
