Hubness Reduction Improves Sentence-BERT Semantic Spaces
Beatrix M. G. Nielsen, Lars Kai Hansen

TL;DR
This paper demonstrates that reducing hubness in Sentence-BERT semantic spaces significantly improves the quality of text embeddings, leading to better performance in semantic tasks.
Contribution
The study identifies hubness as a key issue in Sentence-BERT embeddings and shows that applying combined hubness reduction methods enhances semantic space quality.
Findings
Hubness causes asymmetric neighborhood relations in embeddings.
Applying hubness reduction decreases error rates in neighborhood-based classifiers.
Combined hubness reduction methods can reduce hubness by about 75%.
Abstract
Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
