Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Siyue Zhang; Yilun Zhao; Liyuan Geng; Arman Cohan; Anh Tuan Luu; Chen Zhao

arXiv:2505.15045·cs.CL·May 22, 2025

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper compares diffusion and autoregressive language models for text embeddings, highlighting the advantages of bidirectional architectures in capturing global context and improving retrieval and reasoning tasks.

Contribution

It introduces the first systematic study of diffusion language embedding models, demonstrating their superior performance over LLM-based embeddings in various retrieval and reasoning tasks.

Findings

01

Diffusion models outperform LLMs by 20% on long-document retrieval.

02

Diffusion models improve reasoning-intensive retrieval by 8%.

03

Bidirectional attention is key for encoding global context.

Abstract

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

siyue/ReasonAug
dataset· 19 dl
19 dl

Videos

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Linear Layer · Residual Connection · Weight Decay · Dropout