Conan-embedding: General Text Embedding with More and Better Negative   Samples

Shiyu Li; Yang Tang; Shizhe Chen; Xi Chen

arXiv:2408.15710·cs.CL·August 30, 2024

Conan-embedding: General Text Embedding with More and Better Negative Samples

Shiyu Li, Yang Tang, Shizhe Chen, Xi Chen

PDF

Open Access 2 Models

TL;DR

This paper introduces Conan-embedding, a novel text embedding model that leverages dynamic hard negative mining, cross-GPU balancing, and LLM-generated prompt-response pairs to improve embedding quality, achieving top performance on a Chinese benchmark.

Contribution

The paper presents a new embedding model with a dynamic negative mining strategy, cross-GPU loss balancing, and the use of LLM-generated data, advancing contrastive learning techniques.

Findings

01

Achieves first place on Chinese Massive Text Embedding Benchmark

02

Effectively utilizes more and higher-quality negative samples

03

Improves embedding model performance through novel training strategies

Abstract

With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Authorship Attribution and Profiling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Adam · Layer Normalization · Weight Decay · Dense Connections · WordPiece · Attention Dropout · Linear Warmup With Linear Decay · Byte Pair Encoding