Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
Luke Merrick, Danmei Xu, Gaurav Nuti, Daniel Campos

TL;DR
Arctic-Embed introduces a family of scalable, open-source text embedding models that achieve state-of-the-art retrieval accuracy across various sizes, outperforming some proprietary models.
Contribution
The paper presents a new set of open-source, efficient text embedding models with detailed training recipes and ablation studies, achieving top retrieval performance.
Findings
Models range from 22 to 334 million parameters.
Largest model outperforms some proprietary embedding models.
Training recipe and ablation studies explain performance gains.
Abstract
This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Snowflake/snowflake-arctic-embed-mmodel· 385k dl· ♡ 164385k dl♡ 164
- 🤗Snowflake/snowflake-arctic-embed-m-longmodel· 33k dl· ♡ 3833k dl♡ 38
- 🤗Snowflake/snowflake-arctic-embed-smodel· 50k dl· ♡ 2450k dl♡ 24
- 🤗Snowflake/snowflake-arctic-embed-xsmodel· 211k dl· ♡ 39211k dl♡ 39
- 🤗Snowflake/snowflake-arctic-embed-lmodel· 61k dl· ♡ 10061k dl♡ 100
- 🤗Snowflake/snowflake-arctic-embed-m-v1.5model· 130k dl· ♡ 70130k dl♡ 70
- 🤗nvidia/nemocurator-fineweb-mixtral-edu-classifiermodel· 977 dl· ♡ 7977 dl♡ 7
- 🤗nvidia/nemocurator-fineweb-nemotron-4-edu-classifiermodel· 4.4k dl· ♡ 124.4k dl♡ 12
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
