Conan-embedding: General Text Embedding with More and Better Negative Samples
Shiyu Li, Yang Tang, Shizhe Chen, Xi Chen

TL;DR
This paper introduces Conan-embedding, a novel text embedding model that leverages dynamic hard negative mining, cross-GPU balancing, and LLM-generated prompt-response pairs to improve embedding quality, achieving top performance on a Chinese benchmark.
Contribution
The paper presents a new embedding model with a dynamic negative mining strategy, cross-GPU loss balancing, and the use of LLM-generated data, advancing contrastive learning techniques.
Findings
Achieves first place on Chinese Massive Text Embedding Benchmark
Effectively utilizes more and higher-quality negative samples
Improves embedding model performance through novel training strategies
Abstract
With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Authorship Attribution and Profiling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Adam · Layer Normalization · Weight Decay · Dense Connections · WordPiece · Attention Dropout · Linear Warmup With Linear Decay · Byte Pair Encoding
