TL;DR
This paper systematically evaluates decoder-only large language models for code search, revealing their strengths, limitations, and key factors influencing performance, such as model size and training data composition.
Contribution
It provides the first large-scale comparison of decoder-only LLMs for code search, highlighting their potential and practical considerations for deployment.
Findings
Fine-tuned decoder-only models outperform encoder-only models by 40.4% MAP.
Model size has a non-monotonic effect on performance, with mid-sized models often better.
Multilingual training data improves generalization, while language-specific noise can hinder performance.
Abstract
Code search is essential for code reuse, allowing developers to efficiently locate relevant code snippets. The advent of powerful decoder-only Large Language Models (LLMs) has revolutionized many code intelligence tasks. However, their effectiveness for the retrieval-based task of code search, particularly compared to established encoder-based models, remains underexplored. This paper addresses this gap by presenting a large-scale systematic evaluation of eleven decoder-only LLMs, analyzing their performance across zero-shot and fine-tuned settings. Our results show that fine-tuned decoder-only models, particularly CodeGemma, significantly outperform encoder-only models like UniXcoder, achieving a 40.4% higher Mean Average Precision (MAP) on the CoSQA benchmark. Our analysis further reveals two crucial nuances for practitioners: first, the relationship between model size and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
