Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets
Sultan Alrashed, Francesco Orabona

TL;DR
This paper introduces MixMinMatch, a method leveraging cross-source agreement via MinHash to improve multilingual web datasets, increasing unique tokens and quality without extra computation.
Contribution
It proposes a novel dataset creation technique that uses existing deduplication signals to identify high-quality, diverse multilingual web data from multiple sources.
Findings
Up to 4x more unique tokens in datasets.
Improved quality over single-source baselines.
Enhanced multilingual datasets for Arabic, Turkish, and Hindi.
Abstract
Multilingual data from the web is essential for LLM pretraining. Yet, scraping it is expensive, and research groups repeatedly crawl the same content. For example, we found that over 40\% of tokens across major Arabic web corpora are duplicated between sources. In this work, we propose to use this wasteful redundancy as a quality signal to create high-quality pretraining datasets. Our key insight is that cross-source agreement functions as a free, model-free quality filter: content retained by multiple independent pipelines is more likely to represent high-quality text. Crucially, this signal requires no additional computation beyond standard deduplication, which is already performed at scale when pretraining language models. So, we propose MixMinMatch, a method that combines multiple existing web corpora, performs cross-dataset MinHash deduplication, and identifies documents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Data Quality and Management
