Efficient Code Embeddings from Code Generation Models
Daria Kryvosheieva, Saba Sturua, Michael G\"unther, Scott Martens, Han Xiao

TL;DR
This paper introduces jina-code-embeddings, a new suite of code embedding models that leverage autoregressive pre-training on text and code to enable effective code retrieval, question-answering, and semantic similarity detection across languages.
Contribution
It presents a novel training approach using last-token pooling with autoregressive models, achieving state-of-the-art performance with smaller models.
Findings
State-of-the-art code retrieval accuracy
Effective cross-language code similarity detection
Successful application to technical question-answering
Abstract
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.
| Task type | Query prefix | Document prefix |
| NL2Code | "Find the most relevant code snippet | "Candidate code snippet:\n" |
| given the following query:\n" | ||
| TechQA | "Find the most relevant answer | "Candidate answer:\n" |
| given the following question:\n" | ||
| Code2Code | "Find an equivalent code snippet | "Candidate code snippet:\n" |
| given the following code snippet:\n" | ||
| Code2NL | "Find the most relevant comment | "Candidate comment:\n" |
| given the following code snippet:\n" | ||
| Code2Completion | "Find the most relevant completion | "Candidate completion:\n" |
| given the following start of code | ||
| snippet:\n" |
| Benchmark | JCE-0.5B | JCE-1.5B | JV4 | Qw3-0.6B | VC3 | GE-001 |
|---|---|---|---|---|---|---|
| CoSQA+ | 15.42% | 16.38% | 13.29% | 15.63% | 13.57% | 16.44% |
| CoSQA* | 39.25% | 35.10% | 29.99% | 37.75% | 34.11% | 51.94% |
| MBPP | 89.01% | 90.13% | 89.93% | 88.29% | 94.68% | 93.46% |
| COIR-CSN* | 85.73% | 86.45% | 84.03% | 84.78% | 89.35% | 81.06% |
| CSN* | 90.68% | 91.38% | 84.84% | 90.77% | 93.92% | 91.38% |
| Doc2Code | 95.98% | 96.34% | 91.46% | 94.77% | 97.18% | 96.54% |
| SWE-Bench | 83.00% | 86.33% | 81.00% | 76.12% | 87.02% | 87.40% |
| CES* | 83.25% | 84.43% | 72.75% | 64.21% | 80.30% | 81.69% |
| CP-FT | 63.00% | 65.06% | 45.93% | 38.50% | 59.24% | 61.18% |
| AppsR* | 84.17% | 86.63% | 78.32% | 75.22% | 93.77% | 95.70% |
| LeetCode | 57.86% | 59.075% | 59.11% | 58.23% | 58.89% | 58.40% |
| CodeChef | 94.03% | 96.89% | 87.98% | 84.29% | 99.18% | 99.55% |
| SynText2SQL* | 72.80% | 73.91% | 76.98% | 66.91% | 63.39% | 59.24% |
| Spider | 81.65% | 82.18% | 81.18% | 81.45% | 81.99% | 81.15% |
| WikiSQL | 98.31% | 98.02% | 96.06% | 96.04% | 95.71% | 90.94% |
| CF-MT* | 89.56% | 89.91% | 70.07% | 90.79% | 93.47% | 64.95% |
| CF-ST* | 85.73% | 86.18% | 85.47% | 86.43% | 90.56% | 85.70% |
| StackOQA* | 91.04% | 92.37% | 93.80% | 89.96% | 96.90% | 96.02% |
| DS-1000 | 59.77% | 62.88% | 64.11% | 61.19% | 69.49% | 70.10% |
| MLQuestions | 81.05% | 77.46% | 54.71% | 60.52% | 66.87% | 62.95% |
| CTOC* | 90.37% | 92.54% | 92.23% | 86.28% | 93.49% | 92.59% |
| CTODL* | 41.69% | 37.319% | 46.29% | 31.78% | 38.72% | 32.84% |
| CodeChefXLang | 99.70% | 99.44% | 92.82% | 90.94% | 99.13% | 99.79% |
| CSN-CC* | 90.41% | 91.12% | 83.69% | 91.41% | 90.09% | 84.69% |
| HumanEval | 96.77% | 98.41% | 96.74% | 94.84% | 99.77% | 98.90% |
| Overall AVG | 78.41% | 79.04% | 74.11% | 73.49% | 79.23% | 77.38% |
| MTEB Code AVG | 78.72% | 78.94% | 74.87% | 74.69% | 79.84% | 76.48% |
| Dataset | Task type |
|---|---|
| AppsRetrieval | NL2Code |
| CodeChefP2S | |
| CodeForcesP2S | |
| CodeSearchNet | |
| CommitPackFT | |
| CoSQA | |
| CoSQA+ | |
| Doc2Code | |
| LeetCodeP2S | |
| MBPP | |
| Spider | |
| SWE-Bench | |
| SyntheticText2SQL | |
| WikiSQL | |
| CodeFeedbackMT | TechQA |
| CodeFeedbackST | |
| DataScience | |
| GlaiveCodeAssistantV2 | |
| MLQuestions | |
| StackExchangeBody | |
| StackExchangePost | |
| StackExchangeTitle | |
| StackOverflowQA | |
| CodeChefS2S | Code2Code |
| CodeChefXLang | |
| CodeForcesS2S | |
| CodeTransOceanContest | |
| CodeTransOceanDL | |
| HackerEarth | |
| LeetCodeXLang | |
| SyntheticDLTrans | |
| COIRCodeSearchNetRetrieval | Code2NL |
| CodeSearchNetCCRetrieval | Code2Completion |
| Dataset | Task type |
|---|---|
| AppsRetrieval | NL2Code |
| CodeChefP2S | |
| CodeForcesP2S | |
| CodeSearchNet | |
| CommitPackFT | |
| CoSQA | |
| CoSQA+ | |
| Doc2Code | |
| LeetCodeP2S | |
| MBPP | |
| Spider | |
| SWE-Bench | |
| SyntheticText2SQL | |
| WikiSQL | |
| CodeFeedbackMT | TechQA |
| CodeFeedbackST | |
| DataScience | |
| GlaiveCodeAssistantV2 | |
| MLQuestions | |
| StackExchangeBody | |
| StackExchangePost | |
| StackExchangeTitle | |
| StackOverflowQA | |
| CodeChefS2S | Code2Code |
| CodeChefXLang | |
| CodeForcesS2S | |
| CodeTransOceanContest | |
| CodeTransOceanDL | |
| HackerEarth | |
| LeetCodeXLang | |
| SyntheticDLTrans | |
| COIRCodeSearchNetRetrieval | Code2NL |
| CodeSearchNetCCRetrieval | Code2Completion |
| Dataset | Task type |
|---|---|
| AppsRetrieval | NL2Code |
| CodeChef | |
| CodeEditSearchRetrieval | |
| CodeSearchNetRetrieval | |
| CommitPackFT | |
| CoSQA | |
| CoSQA+ | |
| Doc2Code | |
| HumanEval | |
| LeetCode | |
| MBPP | |
| Spider | |
| SWE-Bench | |
| SyntheticText2SQL | |
| WikiSQL | |
| CodeFeedbackMT | TechQA |
| CodeFeedbackST | |
| DS-1000 | |
| MLQuestions | |
| StackOverflowQA | |
| CodeChefXLang | Code2Code |
| CodeTransOceanContest | |
| CodeTransOceanDL | |
| COIRCodeSearchNetRetrieval | Code2NL |
| CodeSearchNetCCRetrieval | Code2Completion |
| Benchmark | Last-token | Mean | Latent attention |
|---|---|---|---|
| CoSQA+ | 15.42% | 15.36% | 15.55% |
| CoSQA* | 39.25% | 37.13% | 38.58% |
| MBPP | 89.01% | 87.01% | 88.57% |
| COIR-CSN* | 85.73% | 85.01% | 85.50% |
| CSN* | 90.68% | 90.48% | 90.65% |
| Doc2Code | 95.98% | 95.91% | 95.94% |
| SWE-Bench | 83.00% | 83.88% | 83.31% |
| CES* | 83.25% | 82.94% | 83.09% |
| CP-FT | 63.00% | 62.32% | 63.10% |
| AppsR* | 84.17% | 83.26% | 84.43% |
| LeetCode | 57.86% | 58.08% | 58.17% |
| CodeChef | 94.03% | 92.08% | 95.03% |
| SynText2SQL* | 72.80% | 72.60% | 72.93% |
| Spider | 81.65% | 81.99% | 81.57% |
| WikiSQL | 98.31% | 93.50% | 97.85% |
| CF-MT* | 89.56% | 86.09% | 88.95% |
| CF-ST* | 85.73% | 84.55% | 85.23% |
| StackOQA* | 91.04% | 90.46% | 90.58% |
| DS-1000 | 59.77% | 58.91% | 60.20% |
| MLQuestions | 81.05% | 79.78% | 81.07% |
| CTOC* | 90.37% | 86.85% | 90.70% |
| CTODL* | 41.69% | 38.17% | 40.58% |
| CodeChefXLang | 99.70% | 99.31% | 99.21% |
| CSN-CC* | 90.41% | 88.65% | 89.72% |
| HumanEval* | 96.77% | 95.78% | 96.35% |
| Overall AVG | 78.41% | 77.20% | 78.27% |
| MTEB Code AVG | 78.72% | 77.18% | 78.41% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jinaai/jina-code-embeddings-0.5bmodel· 19k dl· ♡ 1819k dl♡ 18
- 🤗jinaai/jina-code-embeddings-1.5bmodel· 200k dl· ♡ 40200k dl♡ 40
- 🤗jinaai/jina-code-embeddings-0.5b-GGUFmodel· 2.0k dl· ♡ 52.0k dl♡ 5
- 🤗jinaai/jina-code-embeddings-1.5b-GGUFmodel· 889 dl· ♡ 15889 dl♡ 15
- 🤗alvarobartt/jina-code-embeddings-1.5bmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗jinaai/jina-code-embeddings-0.5b-mlxmodel· 135 dl135 dl
- 🤗jinaai/jina-code-embeddings-1.5b-mlxmodel· 175 dl175 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
\workshoptitle
DL4CODE
Efficient Code Embeddings from Code Generation Models
Daria Kryvosheieva1,2 Saba Sturua2 Michael Günther2
Scott Martens2 Han Xiao2
1Massachusetts Institute of Technology 2Jina AI GmbH
Prinzessinnenstraße 19, 10969, Berlin, Germany
[email protected] Work done during internship at Jina AI.
Abstract
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.
1 Introduction
The rapid adoption of AI-powered development environments like Cursor and Claude Code has transformed software engineering, with code embedding models serving as a critical foundation for retrieval and context engineering of these systems.
Code embedding models, starting with Gu et al. (2018), have evolved into a well-established subfield with dedicated benchmarks (Husain et al., 2019; Li et al., 2025a) and leaderboards (Enevoldsen et al., 2025). While code generation models like Codex (Zaremba and Brockman, 2021) can directly synthesize code from natural language prompts, practical code generation requires contextual understanding of existing codebases, API usage patterns, and integration requirements. This naturally positions code generation systems as retrieval-augmented generation (RAG) architectures (Lewis et al., 2020), where embedding models serve as the critical retrieval component.
Despite their importance, current code embedding models face a fundamental training data limitation. Supervised training typically relies on aligned data such as inline comments, documentation strings, and pedagogical examples from technical documentation—sources that provide insufficient semantic grounding for complex real-world development scenarios. In contrast, the abundant unaligned code and natural language documentation used to train modern LLMs remains largely underutilized for embedding model development.
We address this gap by introducing two high-quality code embedding models: jina-code-embeddings-0.5b and 1.5b, with 494 million and 1.54 billion parameters, respectively. Our approach implements several key innovations: First, we leverage dedicated pre-trained code generation LLMs as backbones, adapting them specifically for embedding generation. Second, through comprehensive task analysis across functional areas of code embedding applications, we develop targeted training strategies that optimize performance for each use case. The resulting models achieve significant improvements over previous models of comparable size, with benchmark performance competitive with much larger alternatives.
2 Related Work
General-purpose semantic embedding models can function as code embedding models, and if large enough and trained on sufficient relevant data, can even perform comparably with specialized models. For example, among recent general-purpose text embedding models, Qwen3 Embedding (Zhang et al., 2025) and Gemini Embedding (Google DeepMind, 2025) perform well on code tasks. However, general-purpose models that support all kinds of texts are very large and expensive to train and deploy.
Many code embedding models are based on variants of the BERT model (Devlin et al., 2019), like CodeBERT (Feng et al., 2020) and jina-embeddings-v2-base-code (Günther et al., 2024). However, this approach requires specialized datasets and struggles with adequate data sources.
An alternative approach is to take pre-trained general-purpose embedding models and adapt them to code, leveraging their existing language knowledge to compensate for shortcomings in aligned natural language and code data. Sturua et al. (2024) use a LoRA adapter (Hu et al., 2022) to specialize the general-purpose jina-embeddings-v3 model for code retrieval.
Recent developments have demonstrated that autoregressive decoder architectures can be adapted to generate high-quality embeddings (Lee et al., 2025). This class of model generates tokens sequentially as a function of the preceding tokens and can be trained using standard denoising objectives and copious quantities of readily available natural language text. They can then be adapted, after pre-training, to produce high-quality embeddings by pooling the token embedding outputs of the last hidden layer and fine-tuning the model. This is the approach used in the Qwen3 embedding model family (Zhang et al., 2025), and in jina-embeddings-v4 (Günther et al., 2025).
Many different pooling methods have been explored for embedding model generation. Reimers and Gurevych (2019) found mean pooling to perform better than max pooling or CLS pooling111CLS pooling uses the output embedding of a pre-pended [CLS] token as the final embedding vector. for encoder-based transformer models. For decoder-only models, embeddings are most commonly generated via last-token pooling Wang et al. (2024). Recently, Lee et al. (2025) have proposed a novel pooling scheme they call latent attention pooling, inspired by the transformer architecture, which has trainable weights. They report significant improvements in general embedding performance using this method.
Instruction-tuning for specific domains and task types generally yields improved performance (Su et al., 2023). Adding instructions to each text before generating an embedding trains the network to produce embeddings that reflect that instruction. The Qwen3 model (Zhang et al., 2025), which implements instruction-tuning, provides for user-generated instructions, making it difficult to optimize performance and leading to uncertainty about how the model will behave.
3 Model Architecture and Task Prefixes
jina-code-embeddings-0.5b and 1.5b employ an autoregressive decoder backbone architecture, i.e., a model which generates tokens sequentially as a function of the preceding tokens. They build on the Qwen2.5-Coder-0.5B and Qwen2.5-Coder-1.5B backbones (Hui et al., 2024), which are both very compact LLMs.
The final hidden layer of the LLMs is transformed into an embedding via last-token pooling. We found, after some experimentation, that last-token pooling gave us better performance than mean pooling or latent attention pooling, as documented in Appendix B. CLS pooling was not tested, but is generally not favored for decoder-only architectures.
We analyzed downstream code embedding tasks and divided them into five categories: Natural language to code retrieval (NL2Code), technical question answering (TechQA), code-to-code retrieval (Code2Code), code to natural language retrieval (Code2NL), and code to completion retrieval (Code2Completion).
For each of these tasks, we created an instruction string in English that prefixes the text passed to the model, listed in Table 1. Different instruction strings are used for queries and for documents.
4 Training
We initialized the model with weights of the pre-trained backbone Qwen2.5-Coder-0.5B and then applied further training with a contrastive objective using the InfoNCE loss function (van den Oord et al., 2019). Pairs of inputs are classed as related or unrelated, and the model learns to embed related items closely together and unrelated items further apart.
We use Matryoshka representation learning (Kusupati et al., 2022) during training to produce truncatable embeddings, so users can make flexible trade-offs between precision and resource usage.
4.1 Training Data
The training data consists of query-document pairs for a variety of code retrieval tasks, largely using docstrings, comments, commit messages, and problem statements as queries, and matching code snippets, diffs, or answers as documents. We also used a selection of internet forum questions and answers relating to computer technologies. These pairs have been collected from various sources, including the training splits of MTEB code tasks and the non-MTEB code retrieval dataset CoSQA+. Additionally, we adapted public datasets originally created for other purposes. We also used GPT-4o (OpenAI, 2024) to synthetically generate datasets when available data is scarce. Synthetic examples were validated by manual inspection of samples.
Details of the training datasets and synthetic data generation are summarized in Appendix A.
4.2 Procedure
In each training step, we sample a batch of query-document text pairs. We generate normalized embeddings for all texts in the selected pairs. We then construct a matrix of similarity values by calculating the cosine similarity of all combinations of embeddings and in . We train by taking the training embedding pairs as similar, and all other combinations of in each batch as dissimilar.
Then, we apply the contrastive InfoNCE loss function (van den Oord et al., 2019) on the resulting matrix of similarity scores.
[TABLE]
where is the temperature (training hyperparameter), and is the batch size, which increases the weight of small differences in similarity scores in calculating the loss. During training, we maintain constant hyperparameters: , for the 0.5B parameter model and for the 1.5B parameter one, and sequence length is 512.
Training was for 1500 steps on four 80GB VRAM A100 GPUs. Training the 0.5B parameter model took approximately 8.3 hours, and approximately 12 hours for the 1.5B parameter one. As described in Appendix B, we repeated training under three conditions to determine the best pooling method.
5 Evaluation
To assess performance on code retrieval, we evaluate the model on the MTEB-CoIR benchmark (Li et al., 2025b), which consists of 10 tasks spanning text-to-code, code-to-text, code-to-code, and hybrid code retrieval types. We also evaluate the model on code-related MTEB tasks CodeSearchNetRetrieval, CodeEditSearchRetrieval, HumanEval, MBPP, DS-1000, WikiSQL, and MLQuestions, as well as CosQA+ and our in-house benchmarks. See Appendix C for evaluation hyperparameters. Results are reported in Table 2.
Both jina-code-embeddings-0.5b and 1.5b outperform similar-sized general-purpose embedding model Qwen3-Embedding-0.6B and the substantially larger models jina-embeddings-v4 and gemini-embedding-001.
6 Conclusion
We have introduced jina-code-embeddings, a family of code embedding models with 0.5B and 1.5B parameters. By using an autoregressive backbone pre-trained on both text and code, along with task-specific instruction prefixes and last-token pooling, the models excel at a wide variety of tasks and domains related to code retrieval. Despite their smaller size compared to other models, the jina-code-embeddings suite achieves state-of-the-art performance, demonstrating the validity and effectiveness of its unique construction methodology.
Appendix A Training Datasets
Training data for jina-code-embeddings draws on a variety of sources, described in Section 4.1.
- •
Training data splits for MTEB code tasks, and the CoSQA+ dataset.
- •
Other public datasets adapted to our training needs.
- •
Fully or partially synthetic datasets generated using GPT-4o.
We used GPT-4o [OpenAI, 2024] for synthetic data generation. The SyntheticDLTrans dataset consists of generated deep learning code translations between frameworks, an area where very little non-synthetic data is available. We also synthesized a multilingual extension of the CodeChef dataset [Caballero and Sutskever, 2016], using the original programming solutions in C++ and Python to generate solutions in eight more programming languages. The resulting dataset has been adapted for three tasks: CodeChefP2S (problem-to-solution), CodeChefS2S (monolingual solution-to-solution), and CodeChefXLang (crosslingual solution-to-solution).
Table 3 provides details about the provenance of all training datasets. Table 4(a) shows the task categories of each training dataset, and Table 4(b) does the same for the evaluation datasets.
Appendix B Ablation
We trained three versions of the 0.5B model with the same training data, hyperparameters, and number of steps (1500), but with three different pooling methods: last-token, mean, and latent-attention. We found that last-token pooling results in the highest average performance (see Table 5).
Appendix C Evaluation Hyperparameters
jina-code-embeddings and Qwen3-Embedding-0.6B were evaluated in FP16 with a batch size of 8 and a sequence length of 8192; jina-embeddings-v4 was evaluated in BF16 with a batch size of 8 and a sequence length of 8192. The Voyage and Gemini models were evaluated via the respective APIs with a batch size of 8, except the tasks COIRCodeSearchNetRetrieval and CodeSearchNetCCRetrieval, which we did not evaluate due to the large size of the benchmarks and the resulting cost in time and money, so we took public scores from the MTEB GitHub.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Caballero and Sutskever [2016] Ethan Caballero and Ilya Sutskever. Description 2Code Dataset, 2016. URL https://github.com/ethancaballero/description 2code .
- 2Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) , 2019. doi: 10.18653/V 1/N 19-1423 . URL https://aclanthology.org/N 19-1423/
- 3Enevoldsen et al. [2025] Kenneth Enevoldsen, Isaac Chung, et al. MMTEB: Massive Multilingual Text Embedding Benchmark. ar Xiv preprint ar Xiv:2502.13595 , 2025. URL https://arxiv.org/abs/2502.13595 .
- 4Feng et al. [2020] Zhangyin Feng, Daya Guo, Duyu Tang, et al. Code BERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 1536–1547, 2020. doi: 10.18653/v 1/2020.findings-emnlp.139 . URL https://aclanthology.org/2020.findings-emnlp.139/ .
- 5Google Deep Mind [2025] Google Deep Mind. Gemini Embedding: Generalizable Embeddings from Gemini. ar Xiv preprint ar Xiv:2503.07891 , 2025. URL https://arxiv.org/abs/2503.07891 .
- 6Gu et al. [2018] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep Code Search. In Proceedings of the 40th International Conference on Software Engineering , ICSE ’18, page 933–944, 2018. doi: 10.1145/3180155.3180167 . URL https://dl.acm.org/doi/10.1145/3180155.3180167 .
- 7Günther et al. [2024] Michael Günther, Jackmin Ong, Isabelle Mohr, et al. Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents. ar Xiv preprint ar Xiv:2310.19923 , 2024. URL https://arxiv.org/abs/2310.19923 .
- 8Günther et al. [2025] Michael Günther, Saba Sturua, Mohammad Kalim Akram, et al. jina-embeddings-v 4: Universal Embeddings for Multimodal Multilingual Retrieval. ar Xiv preprint ar Xiv:2506.18902 , 2025. URL https://arxiv.org/abs/2506.18902 .
