On the Challenges and Opportunities of Learned Sparse Retrieval for Code
Simon Lupart, Maxime Louis, Thibault Formal, Herv\'e D\'ejean, St\'ephane Clinchant

TL;DR
This paper introduces SPLADE-Code, a learned sparse retrieval model for code that achieves state-of-the-art results and fast retrieval times, addressing challenges like subword fragmentation and semantic gaps in code retrieval.
Contribution
The paper presents SPLADE-Code, the first large-scale learned sparse retrieval model specifically designed for code, with a lightweight training pipeline and competitive performance.
Findings
SPLADE-Code achieves 75.4 on MTEB Code with under 1B parameters.
It attains 79.0 with 8B parameters, outperforming many existing models.
LSR enables sub-millisecond retrieval latency on large code collections.
Abstract
Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
