KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao; Xinshuo Hu; Zifei Shan; Shouzheng Huang; Yao Zhou; Xin Zhang; Zetian Sun; Zhenyu Liu; Dongfang Li; Xinyuan Wei; Youcheng Pan; Yang Xiang; Meishan Zhang; Haofen Wang; Jun Yu; Baotian Hu; Min Zhang

arXiv:2506.20923·cs.CL·October 15, 2025

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

PDF

Open Access 1 Repo 4 Models 2 Datasets 3 Reviews

TL;DR

KaLM-Embedding-V2 introduces a series of compact, versatile embedding models that leverage advanced training techniques and high-quality data to achieve state-of-the-art performance, rivaling much larger models.

Contribution

The paper presents novel training strategies and data curation methods that significantly enhance the performance of small-sized embedding models.

Findings

01

Achieves state-of-the-art results on the Massive Text Embedding Benchmark.

02

Outperforms comparable size models and rivals larger models by 3-26x.

03

Demonstrates the effectiveness of multi-stage training and high-quality data.

Abstract

Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* The final model performs well on benchmarks with fewer parameters. * Focal-style reweighing, online hard-negative mixing and contrastive distillation are novel in the text embedding space. * Very detailed and sounds experimental results leave few questions unanswered.

Weaknesses

* The most impactful technique of focal-style reweighing is commonly used in machine learning.

Reviewer 02Rating 6Confidence 4

Strengths

- Methodical system design. Detailed explanation of architecture, data, training all contribute - Ablations support most claims (focal loss, hard-negative mixing, distillation components) - Description of training data and pipeline for it’s creation

Weaknesses

- The distillation stage is not fully justified, as the paper does not explore whether applying distillation earlier or interleaving phases could be equally or more effective. - Instruction dependence remains unclear, since the paper does not evaluate performance without instructions or under instruction mismatch. - Causal-vs-bidirectional attention choice insufficiently validated, lacking direct comparison with strong causal-mask baselines from the same model family. - No comparison of performa

Reviewer 03Rating 4Confidence 3

Strengths

1 - New embedding models that surpass some SoTA LLM-based and encoder-based models with very few parameters. 2 - Authors give extensive details about the training recipe and the datasets used for it. 3 - The model is compared to multiple existing models and evaluated on both English and Chinese. 4 - Good open-source contribution if the training code is released, as it goes beyond most recent models that are open-weights only.

Weaknesses

1 - The originality of this work is not very clear, as it combines existing training recipes that were used for training embedding models (see NV-Embed models and Qwen3 embedding models). 2 - Table 2 is clearly missing SoTA models on MTEB(eng, v1). Checking the MTEB leaderboard (on MTEB(eng, v2)), I see for example that Qwen3-4B and 8B achieve good performance on the benchmark but are not listed in the >1B models list. They outperform KaLM-v2 but they are larger, it would be great to list t

Code & Models

Repositories

HITsz-TMG/KaLM-Embedding
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare

MethodsADaptive gradient method with the OPTimal convergence rate · ALIGN