TL;DR
This paper presents a novel training approach combining distillation and contrastive loss to create compact, high-performance text embedding models that support long texts and multiple languages.
Contribution
It introduces a new training regimen that outperforms existing methods for small models and provides publicly available weights to foster further research.
Findings
Benchmark scores match or exceed state-of-the-art for similar-sized models.
Models support long texts up to 32k tokens in many languages.
Embeddings remain robust under truncation and quantization.
Abstract
Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jinaai/jina-embeddings-v5-text-nanomodel· 269k dl· ♡ 71269k dl♡ 71
- 🤗jinaai/jina-embeddings-v5-text-smallmodel· 240k dl· ♡ 167240k dl♡ 167
- 🤗jinaai/jina-embeddings-v5-text-small-retrievalmodel· 266k dl· ♡ 23266k dl♡ 23
- 🤗jinaai/jina-embeddings-v5-text-small-clusteringmodel· 3.3k dl· ♡ 63.3k dl♡ 6
- 🤗jinaai/jina-embeddings-v5-text-small-text-matchingmodel· 7.6k dl· ♡ 107.6k dl♡ 10
- 🤗jinaai/jina-embeddings-v5-text-small-classificationmodel· 3.5k dl· ♡ 33.5k dl♡ 3
- 🤗jinaai/jina-embeddings-v5-text-nano-retrievalmodel· 68k dl· ♡ 1268k dl♡ 12
- 🤗jinaai/jina-embeddings-v5-text-nano-clusteringmodel· 10k dl· ♡ 510k dl♡ 5
- 🤗jinaai/jina-embeddings-v5-text-nano-text-matchingmodel· 5.3k dl· ♡ 65.3k dl♡ 6
- 🤗jinaai/jina-embeddings-v5-text-nano-classificationmodel· 5.6k dl· ♡ 75.6k dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
