M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu

TL;DR
M3-Embedding is a versatile multilingual and multi-functional text embedding model capable of handling various retrieval tasks and input granularities, trained with a novel self-knowledge distillation method, achieving state-of-the-art results.
Contribution
The paper introduces M3-Embedding, a novel multi-lingual, multi-functionality, and multi-granularity embedding model with a unique self-knowledge distillation training approach.
Findings
Achieves state-of-the-art results on multilingual retrieval benchmarks.
Supports over 100 languages with high retrieval accuracy.
Handles inputs from short sentences to long documents up to 8,192 tokens.
Abstract
In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in \textit{Multi-Linguality}, \textit{Multi-Functionality}, and \textit{Multi-Granularity}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BAAI/bge-m3model· 14.5M dl· ♡ 287114.5M dl♡ 2871
- 🤗BAAI/bge-reranker-v2-m3model· 5.6M dl· ♡ 9375.6M dl♡ 937
- 🤗Alibaba-NLP/gte-multilingual-basemodel· 914k dl· ♡ 353914k dl♡ 353
- 🤗avemio/German-RAG-BGE-M3-TRIPLES-HESSIAN-AImodel· 17 dl· ♡ 117 dl♡ 1
- 🤗avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AImodel· 205 dl· ♡ 2205 dl♡ 2
- 🤗BAAI/bge-m3-unsupervisedmodel· 6.0k dl· ♡ 186.0k dl♡ 18
- 🤗BAAI/bge-m3-retromaemodel· 1.6k dl· ♡ 181.6k dl♡ 18
- 🤗Ruddy0201/YOUR_MODEL_NAMEmodel· 4 dl4 dl
- 🤗dabitbol/bge-m3-sparse-elasticmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗Bylaw/BAAI-bge-m3model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
