Beyond the Geometric Curse: High-Dimensional N-Gram Hashing for Dense Retrieval
Sangeet Sharma

TL;DR
This paper introduces NUMEN, a training-free, high-dimensional hashing method for dense retrieval that surpasses traditional sparse methods like BM25 by removing the dimensionality bottleneck.
Contribution
NUMEN demonstrates that eliminating training and using deterministic hashing enables dense retrieval models to outperform sparse baselines.
Findings
NUMEN achieves 93.90% Recall@100 at 32,768 dimensions.
NUMEN surpasses the sparse BM25 baseline of 93.6%.
Removing the embedding bottleneck improves dense retrieval performance.
Abstract
Why do even the most powerful 7B-parameter embedding models struggle with simple retrieval tasks that the decades old BM25 handles with ease? Recent theory suggests that this happens because of a dimensionality bottleneck. This occurs when we force infinite linguistic nuances into small, fixed-length learned vectors. We developed NUMEN to break this bottleneck by removing the learning process entirely. Instead of training heavy layers to map text to a constrained space, NUMEN uses deterministic character hashing to project language directly onto high-dimensional vectors. This approach requires no training, supports an unlimited vocabulary, and allows the geometric capacity scale as needed. On the LIMIT benchmark, NUMEN achieves 93.90 % Recall@100 at 32,768 dimensions. This makes it the first dense retrieval model to officially surpass the sparse BM25 baseline 93.6 %. Our findings show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Face recognition and analysis · Advanced Neural Network Applications
