Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

Santosh Kumar Radha; Oktay Goktas

arXiv:2605.18561·cs.IR·May 19, 2026

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

Santosh Kumar Radha, Oktay Goktas

PDF

TL;DR

This paper introduces a simple modification to BM25, replacing its logarithm with a q-logarithm, significantly improving code retrieval performance without increasing index or query time.

Contribution

It proposes a drop-in fix for BM25 using q-logarithm transform, enhancing retrieval accuracy in fixed tokenization scenarios.

Findings

01

NDCG@10 nearly doubles on CoIR CodeSearchNet Go dataset.

02

Effect is consistent across multiple programming languages.

03

No increase in index or query latency.

Abstract

In retrieval-augmented coding, failures often begin when the relevant file is absent from the retrieved context. Under frozen generic tokenization, where a BM25 index has been built by a search system whose analyzer the practitioner does not control, this failure is routine: BM25's logarithmic RSJ-odds IDF under-separates the identifier tail that distinguishes one function from another. We replace the outer logarithm of the Robertson-Sp\"arck-Jones odds with a q-logarithm. At q=1 the transform recovers BM25 exactly by L'H\^opital's rule, and for q<1 it is a Box-Cox transform of the RSJ odds with lambda = 1-q. On CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 rises from 0.2575 to 0.4874 (absolute +0.2299; +89.3% relative; zero sign reversals in 10,000 paired-bootstrap resamples, reported as p <= 10^-4). The effect is graded across code languages and is near-zero on BEIR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.