# Efficient Code Embeddings from Code Generation Models

**Authors:** Daria Kryvosheieva, Saba Sturua, Michael G\"unther, Scott Martens, Han Xiao

arXiv: 2508.21290 · 2025-09-01

## TL;DR

This paper introduces jina-code-embeddings, a new suite of code embedding models that leverage autoregressive pre-training on text and code to enable effective code retrieval, question-answering, and semantic similarity detection across languages.

## Contribution

It presents a novel training approach using last-token pooling with autoregressive models, achieving state-of-the-art performance with smaller models.

## Key findings

- State-of-the-art code retrieval accuracy
- Effective cross-language code similarity detection
- Successful application to technical question-answering

## Abstract

jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21290/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21290/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/2508.21290/full.md

---
Source: https://tomesphere.com/paper/2508.21290