Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets

Sultan Alrashed; Francesco Orabona

arXiv:2512.18834·cs.CL·January 30, 2026

Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets

Sultan Alrashed, Francesco Orabona

PDF

Open Access 4 Models 5 Datasets

TL;DR

This paper introduces MixMinMatch, a method leveraging cross-source agreement via MinHash to improve multilingual web datasets, increasing unique tokens and quality without extra computation.

Contribution

It proposes a novel dataset creation technique that uses existing deduplication signals to identify high-quality, diverse multilingual web data from multiple sources.

Findings

01

Up to 4x more unique tokens in datasets.

02

Improved quality over single-source baselines.

03

Enhanced multilingual datasets for Arabic, Turkish, and Hindi.

Abstract

Multilingual data from the web is essential for LLM pretraining. Yet, scraping it is expensive, and research groups repeatedly crawl the same content. For example, we found that over 40\% of tokens across major Arabic web corpora are duplicated between sources. In this work, we propose to use this wasteful redundancy as a quality signal to create high-quality pretraining datasets. Our key insight is that cross-source agreement functions as a free, model-free quality filter: content retained by multiple independent pipelines is more likely to represent high-quality text. Crucially, this signal requires no additional computation beyond standard deduplication, which is already performed at scale when pretraining language models. So, we propose MixMinMatch, a method that combines multiple existing web corpora, performs cross-dataset MinHash deduplication, and identifies documents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Data Quality and Management