TL;DR
This paper introduces extsc{CoREB}, a comprehensive code search benchmark and a fine-tuned reranker, addressing limitations of existing datasets and evaluating models across multiple tasks and programming languages.
Contribution
It presents a contamination-limited, multitask benchmark with a fine-tuned reranker that improves the full code search pipeline beyond retrieval.
Findings
Code-specialized embeddings outperform general encoders in code-to-code retrieval.
Short keyword queries significantly reduce model effectiveness.
Fine-tuned extsc{CoREB-Reranker} achieves consistent improvements across tasks.
Abstract
Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
