Nomic Embed: Training a Reproducible Long Context Text Embedder
Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy, Mulyar

TL;DR
This paper introduces nomic-embed-text-v1, a fully reproducible open-source long-context text embedding model that surpasses existing models on multiple benchmarks, with complete training data and code released for full transparency.
Contribution
It presents the first fully reproducible open-source long-context text embedding model with open data, code, and weights, outperforming proprietary models on key benchmarks.
Findings
Outperforms OpenAI Ada-002 on benchmarks
Supports 8192 token context length
Fully reproducible with open data and code
Abstract
This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nomic-ai/nomic-embed-text-v1.5model· 10.4M dl· ♡ 78710.4M dl♡ 787
- 🤗nomic-ai/nomic-embed-text-v1-ablatedmodel· 303 dl· ♡ 4303 dl♡ 4
- 🤗nomic-ai/nomic-embed-text-v1-unsupervisedmodel· 748 dl· ♡ 15748 dl♡ 15
- 🤗nomic-ai/nomic-embed-text-v1model· 3.0M dl· ♡ 5593.0M dl♡ 559
- 🤗Severian/nomicmodel· 56 dl56 dl
- 🤗Alibaba-NLP/new-implmodel· ♡ 18♡ 18
- 🤗lightbird-ai/nomicmodel· 33 dl33 dl
- 🤗corto-ai/nomic-embed-text-v1model· 10k dl· ♡ 310k dl♡ 3
- 🤗CAiRE/UniVaR-lambda-80model· 79 dl79 dl
- 🤗CAiRE/UniVaR-lambda-20model· 82 dl82 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTranslation Studies and Practices · Interpreting and Communication in Healthcare · Natural Language Processing Techniques
