Towards A Generalist Code Embedding Model Based On Massive Data Synthesis

Chaofan Li; Jianlyu Chen; Yingxia Shao; Defu Lian; Zheng Liu

arXiv:2505.12697·cs.IR·May 20, 2025

Towards A Generalist Code Embedding Model Based On Massive Data Synthesis

Chaofan Li, Jianlyu Chen, Yingxia Shao, Defu Lian, Zheng Liu

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces CodeR, a state-of-the-art code embedding model trained on a large synthetic dataset using a novel curriculum learning strategy, significantly improving code retrieval performance and generalization.

Contribution

The paper presents CodeR, a new code embedding model trained on CodeR-Pile with a novel data synthesis pipeline and Annealing curriculum learning, advancing code retrieval capabilities.

Findings

01

Outperforms existing baselines on 16 code retrieval tasks

02

Exhibits strong out-of-domain generalization

03

Demonstrates effectiveness of synthetic data and curriculum learning

Abstract

Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce \textbf{CodeR} (\underline{Code} \underline{R}etrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose Annealing, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

flagopen/flagembedding
pytorchOfficial

Models

🤗
BAAI/bge-code-v1
model· 12k dl· ♡ 48
12k dl♡ 48

Datasets

nebula2025/CodeR-Pile
dataset· 3.7k dl
3.7k dl

Videos

Towards A Generalist Code Embedding Model Based On Massive Data Synthesis· slideslive

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Software Engineering Techniques and Practices

MethodsSoftmax · Attention Is All You Need