# QZhou-Embedding Technical Report

**Authors:** Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu

arXiv: 2508.21632 · 2025-09-01

## TL;DR

QZhou-Embedding is a versatile text embedding model built on Qwen2.5-7B-Instruct, utilizing advanced data synthesis and training strategies to achieve state-of-the-art results in multiple benchmarks and tasks.

## Contribution

The paper introduces a novel multi-task framework with diverse data augmentation and a two-stage training process, significantly improving embedding quality and retrieval performance.

## Key findings

- Achieves top results on MTEB and CMTEB benchmarks
- Demonstrates the importance of high-quality, diverse training data
- Leverages LLM-generated data to enhance embedding models

## Abstract

We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21632/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21632/full.md

## References

76 references — full list in the complete paper: https://tomesphere.com/paper/2508.21632/full.md

---
Source: https://tomesphere.com/paper/2508.21632