Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix
Santosh Kumar Radha, Oktay Goktas

TL;DR
This paper introduces a simple modification to BM25, replacing its logarithm with a q-logarithm, significantly improving code retrieval performance without increasing index or query time.
Contribution
It proposes a drop-in fix for BM25 using q-logarithm transform, enhancing retrieval accuracy in fixed tokenization scenarios.
Findings
NDCG@10 nearly doubles on CoIR CodeSearchNet Go dataset.
Effect is consistent across multiple programming languages.
No increase in index or query latency.
Abstract
In retrieval-augmented coding, failures often begin when the relevant file is absent from the retrieved context. Under frozen generic tokenization, where a BM25 index has been built by a search system whose analyzer the practitioner does not control, this failure is routine: BM25's logarithmic RSJ-odds IDF under-separates the identifier tail that distinguishes one function from another. We replace the outer logarithm of the Robertson-Sp\"arck-Jones odds with a q-logarithm. At q=1 the transform recovers BM25 exactly by L'H\^opital's rule, and for q<1 it is a Box-Cox transform of the RSJ odds with lambda = 1-q. On CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 rises from 0.2575 to 0.4874 (absolute +0.2299; +89.3% relative; zero sign reversals in 10,000 paired-bootstrap resamples, reported as p <= 10^-4). The effect is graded across code languages and is near-zero on BEIR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
