Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Dylan Sam; Ayan Chakrabarti; Afshin Rostamizadeh; Srikumar Ramalingam; Gui Citovsky; Sanjiv Kumar

arXiv:2502.02494·cs.LG·October 22, 2025

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of existing similarity metrics for selecting training data in language model pretraining, revealing that standard off-the-shelf embeddings are often inadequate for this purpose.

Contribution

It introduces a new framework to assess and compare the suitability of similarity metrics specifically for pretraining data curation in language models.

Findings

01

Standard off-the-shelf embeddings underperform for data curation in pretraining.

02

Simple embeddings from models trained on the same corpus can be more effective.

03

The framework helps guide the development of better similarity metrics for pretraining datasets.

Abstract

Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques