Repository-level Code Search with Neural Retrieval Methods
Siddharth Gandhi, Luyu Gao, Jamie Callan

TL;DR
This paper introduces a neural reranking system for repository-level code search that combines BM25 retrieval with CodeBERT to significantly improve relevance ranking using commit histories.
Contribution
It presents a novel multi-stage reranking approach leveraging commit messages and source code, achieving substantial performance gains over baseline methods.
Findings
Up to 80% improvement in MAP, MRR, and P@1 metrics.
Effective use of commit histories for code relevance.
Demonstrated on a new dataset from 7 open-source repositories.
Abstract
This paper presents a multi-stage reranking system for repository-level code search, which leverages the vastly available commit histories of large open-source repositories to aid in bug fixing. We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug. The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. By learning patterns from diverse repositories and their commit histories, the system can surface relevant files for the task at hand. The system leverages both commit messages and source code for relevance matching, and is evaluated in both normal and oracle settings. Experiments on a new dataset created from 7 popular open-source repositories demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research
MethodsCodeBERT · Sparse Evolutionary Training
