KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang, Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, Min, Zhang

TL;DR
KaLM-Embedding is a multilingual embedding model trained on high-quality, diverse data using innovative techniques, outperforming comparable models and setting new standards in multilingual embedding performance.
Contribution
Introduces KaLM-Embedding, a novel multilingual embedding model utilizing improved training data and techniques, adopting Qwen2-0.5B architecture for enhanced performance.
Findings
Outperforms similar-sized models on MTEB benchmark
Leverages cleaner, more diverse training data for better embeddings
Sets new performance standards for multilingual embedding models
Abstract
As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tencent/KaLM-Embedding-Gemma3-12B-2511model· 98k dl· ♡ 8898k dl♡ 88
- 🤗HIT-TMG/KaLM-embedding-multilingual-mini-v1model· 6.7k dl· ♡ 316.7k dl♡ 31
- 🤗HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1model· 331 dl· ♡ 33331 dl♡ 33
- 🤗HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5model· 1.9k dl· ♡ 641.9k dl♡ 64
- 🤗MesTruck/KaLM-embedding-multilingual-mini-instruct-v1-GGUFmodel· 5 dl5 dl
- 🤗HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2model· 450 dl· ♡ 33450 dl♡ 33
- 🤗KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5model· 9.7k dl· ♡ 589.7k dl♡ 58
- 🤗thomasht86/KaLM-embedding-multilingual-mini-instruct-v2.5-ONNXmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗Gidigi/gidigi_79c07173_0003model· 8 dl8 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsADaptive gradient method with the OPTimal convergence rate
