Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment
Jingcheng Deng, Zhongtao Jiang, Liang Pang, Liwei Chen, Kun Xu, Zihao Wei, Huawei Shen, Xueqi Cheng

TL;DR
This paper introduces AutoRegEmbed, a contrastive learning method that aligns LLM embeddings with their conditional probability distributions, improving semantic encoding and outperforming traditional contrastive approaches.
Contribution
The paper presents a novel contrastive learning framework that integrates information compression and distribution alignment to better utilize LLM embeddings.
Findings
AutoRegEmbed outperforms traditional contrastive methods.
Achieves performance comparable to state-of-the-art models.
Effectively captures global semantics in embeddings.
Abstract
A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
Topicssemigroups and automata theory · Natural Language Processing Techniques · Algorithms and Data Compression
MethodsContrastive Learning · ALIGN
