Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Michael G\"unther, Louis Milliken, Jonathan Geuter, Georgios, Mastrapas, Bo Wang, Han Xiao

TL;DR
Jina Embeddings introduces high-performance sentence models optimized for semantic understanding, developed through meticulous data preparation and evaluated on extensive benchmarks, including a novel negation-aware dataset.
Contribution
The paper presents a new set of sentence embedding models with improved semantic capture, emphasizing data quality and introducing a negation-aware dataset for better linguistic understanding.
Findings
High performance on the MTEB benchmark
Effective handling of negated statements
Public availability of negation dataset
Abstract
Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model's awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jinaai/jina-embedding-s-en-v1model· 1.2k dl· ♡ 261.2k dl♡ 26
- 🤗jinaai/jina-embedding-b-en-v1model· 4.2k dl· ♡ 84.2k dl♡ 8
- 🤗jinaai/jina-embedding-l-en-v1model· 194 dl· ♡ 25194 dl♡ 25
- 🤗jinaai/jina-embedding-t-en-v1model· 181 dl· ♡ 30181 dl♡ 30
- 🤗michaelfeil/ct2fast-jina-embedding-l-en-v1model· 9 dl9 dl
- 🤗michaelfeil/ct2fast-jina-embedding-t-en-v1model· 6 dl6 dl
- 🤗michaelfeil/ct2fast-jina-embedding-s-en-v1model· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Mining Algorithms and Applications
