Embedding And Clustering Your Data Can Improve Contrastive Pretraining
Luke Merrick

TL;DR
This paper proposes a novel data stratification method using clustering to improve contrastive pretraining of text embeddings, demonstrating significant performance gains on retrieval tasks.
Contribution
It introduces a clustering-based data organization approach for contrastive pretraining, extending beyond source-based stratification to semantic clusters within sources.
Findings
Increased NDCG@10 on MSMARCO dataset
Clustering improves contrastive pretraining effectiveness
Connects clustering with TAS-B and ANCE methodologies
Abstract
Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Snowflake/snowflake-arctic-embed-mmodel· 385k dl· ♡ 164385k dl♡ 164
- 🤗Snowflake/snowflake-arctic-embed-m-longmodel· 33k dl· ♡ 3833k dl♡ 38
- 🤗Snowflake/snowflake-arctic-embed-smodel· 50k dl· ♡ 2450k dl♡ 24
- 🤗Snowflake/snowflake-arctic-embed-xsmodel· 211k dl· ♡ 39211k dl♡ 39
- 🤗Snowflake/snowflake-arctic-embed-lmodel· 61k dl· ♡ 10061k dl♡ 100
- 🤗Snowflake/snowflake-arctic-embed-m-v1.5model· 130k dl· ♡ 70130k dl♡ 70
- 🤗dragonkue/snowflake-arctic-embed-l-v2.0-komodel· 19k dl· ♡ 4519k dl♡ 45
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovative Teaching Methods · Science Education and Pedagogy
MethodsAttentive Walk-Aggregating Graph Neural Network · k-Means Clustering
