Diffusion-Pretrained Dense and Contextual Embeddings
Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, Denis Bykov

TL;DR
This paper introduces pplx-embed, a family of multilingual embedding models using diffusion-based pretraining and contrastive learning, achieving state-of-the-art retrieval performance on multiple benchmarks and real-world web-scale data.
Contribution
It presents diffusion-pretrained multilingual embeddings with bidirectional context capture, novel pooling strategies, and two model variants optimized for retrieval tasks.
Findings
pplx-embed-v1 achieves competitive results on multiple benchmarks.
pplx-embed-context-v1 sets new records on ConTEB.
Models perform well in large-scale, real-world search scenarios.
Abstract
In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗perplexity-ai/pplx-embed-v1-0.6bmodel· 1.0M dl· ♡ 2011.0M dl♡ 201
- 🤗perplexity-ai/pplx-embed-context-v1-0.6bmodel· 170k dl· ♡ 51170k dl♡ 51
- 🤗perplexity-ai/pplx-embed-v1-4bmodel· 15k dl· ♡ 5215k dl♡ 52
- 🤗agentmish/pplx-embed-v1-0.6b-mlxmodel· 101 dl· ♡ 2101 dl♡ 2
- 🤗perplexity-ai/pplx-embed-context-v1-4bmodel· 18k dl· ♡ 3118k dl♡ 31
- 🤗agentmish/pplx-embed-v1-4b-mlxmodel· 30 dl30 dl
- 🤗LHC88/pplx-embed-v1-4Bmodel· 31 dl31 dl
- 🤗mmrech/pplx-embed-v1-0.6bmodel· 14 dl14 dl
- 🤗tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standardmodel· 74 dl74 dl
- 🤗beaupi/pplx-embed-context-v1-4b-oQ8model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Domain Adaptation and Few-Shot Learning
