The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour

TL;DR
This paper introduces The MiniPile Challenge, a curated 6GB subset of the larger The Pile corpus, designed for efficient pre-training of language models on smaller datasets, demonstrating competitive performance with significantly less data.
Contribution
It presents a simple data filtering method to create MiniPile, enabling effective language model pre-training on a small, diverse dataset with minimal performance loss.
Findings
MiniPile enables pre-training with only 6GB of data.
Models trained on MiniPile show less than 2.6% performance drop.
MiniPile is publicly available for research use.
Abstract
The ever-growing diversity of pre-training text corpora has equipped language models with generalization capabilities across various downstream tasks. However, such diverse datasets are often too large for academic budgets; hence, most research on Transformer architectures, training procedures, optimizers, etc. gets conducted on smaller, homogeneous datasets. To this end, we present The MiniPile Challenge, where one pre-trains a language model on a diverse text corpus containing at most 1M documents. MiniPile is a 6GB subset of the deduplicated 825GB The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using -means, and (3) filter out low-quality clusters. To verify MiniPile's suitability for language model pre-training, we use it to pre-train a BERT and T5…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · WordPiece · Weight Decay · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Adafactor
