TL;DR
This paper introduces a unified instruction-based framework for learning versatile text embeddings using a decoder-only LLM, achieving strong performance across diverse IR and non-IR tasks without task-specific fine-tuning.
Contribution
It proposes a novel approach combining in-context learning, soft supervision, and adaptive hard-negative mining to generate high-quality, generalized text embeddings.
Findings
Achieves top performance on the MTEB benchmark across 41 tasks.
Outperforms larger or fully fine-tuned models in generalization.
Demonstrates effective use of soft relevance scores and adaptive negative sampling.
Abstract
This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
