LLM2Vec-Gen: Generative Embeddings from Large Language Models

Parishad BehnamGhader; Vaibhav Adlakha; Fabian David Schmidt; Nicolas Chapados; Marius Mosbach; Siva Reddy

arXiv:2603.10913·cs.CL·April 3, 2026

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy

PDF

1 Repo 11 Models

TL;DR

LLM2Vec-Gen introduces a self-supervised method to generate embeddings directly in the LLM's output space, preserving semantics and enabling improved safety and reasoning capabilities.

Contribution

It proposes a novel approach that produces embeddings in the LLM's response space without fine-tuning the model, using special tokens and an unsupervised teacher.

Findings

01

Achieves 8.8% improvement on MTEB benchmark over unsupervised teacher.

02

Reduces harmful content retrieval by up to 22.6%.

03

Improves reasoning-intensive retrieval by up to 35.6%.

Abstract

Fine-tuning LLM-based text embedders via contrastive learning maps inputs and outputs into a new representational space, discarding the LLM's output semantics. We propose LLM2Vec-Gen, a self-supervised alternative that instead produces embeddings directly in the LLM's output space by learning to represent the model's potential response. Specifically, trainable special tokens are appended to the input and optimized to compress the LLM's own response into a fixed-length embedding, guided by an unsupervised embedding teacher and a reconstruction objective. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 8.8% over the unsupervised embedding teacher. Since the embeddings preserve the LLM's response-space semantics, they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mcgill-nlp/llm2vec-gen
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.