Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Jingcheng Deng; Zhongtao Jiang; Liang Pang; Liwei Chen; Kun Xu; Zihao Wei; Huawei Shen; Xueqi Cheng

arXiv:2502.11401·cs.CL·October 17, 2025

Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Jingcheng Deng, Zhongtao Jiang, Liang Pang, Liwei Chen, Kun Xu, Zihao Wei, Huawei Shen, Xueqi Cheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces AutoRegEmbed, a contrastive learning method that aligns LLM embeddings with their conditional probability distributions, improving semantic encoding and outperforming traditional contrastive approaches.

Contribution

The paper presents a novel contrastive learning framework that integrates information compression and distribution alignment to better utilize LLM embeddings.

Findings

01

AutoRegEmbed outperforms traditional contrastive methods.

02

Achieves performance comparable to state-of-the-art models.

03

Effectively captures global semantics in embeddings.

Abstract

A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trustedllm/autoregembed
pytorchOfficial

Videos

Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment· underline

Taxonomy

Topicssemigroups and automata theory · Natural Language Processing Techniques · Algorithms and Data Compression

MethodsContrastive Learning · ALIGN