Towards A Generalist Code Embedding Model Based On Massive Data Synthesis
Chaofan Li, Jianlyu Chen, Yingxia Shao, Defu Lian, Zheng Liu

TL;DR
This paper introduces CodeR, a state-of-the-art code embedding model trained on a large synthetic dataset using a novel curriculum learning strategy, significantly improving code retrieval performance and generalization.
Contribution
The paper presents CodeR, a new code embedding model trained on CodeR-Pile with a novel data synthesis pipeline and Annealing curriculum learning, advancing code retrieval capabilities.
Findings
Outperforms existing baselines on 16 code retrieval tasks
Exhibits strong out-of-domain generalization
Demonstrates effectiveness of synthetic data and curriculum learning
Abstract
Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce \textbf{CodeR} (\underline{Code} \underline{R}etrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose Annealing, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Engineering Techniques and Practices
MethodsSoftmax · Attention Is All You Need
