The MiniPile Challenge for Data-Efficient Language Models

Jean Kaddour

arXiv:2304.08442·cs.CL·April 18, 2023·5 cites

The MiniPile Challenge for Data-Efficient Language Models

Jean Kaddour

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper introduces The MiniPile Challenge, a curated 6GB subset of the larger The Pile corpus, designed for efficient pre-training of language models on smaller datasets, demonstrating competitive performance with significantly less data.

Contribution

It presents a simple data filtering method to create MiniPile, enabling effective language model pre-training on a small, diverse dataset with minimal performance loss.

Findings

01

MiniPile enables pre-training with only 6GB of data.

02

Models trained on MiniPile show less than 2.6% performance drop.

03

MiniPile is publicly available for research use.

Abstract

The ever-growing diversity of pre-training text corpora has equipped language models with generalization capabilities across various downstream tasks. However, such diverse datasets are often too large for academic budgets; hence, most research on Transformer architectures, training procedures, optimizers, etc. gets conducted on smaller, homogeneous datasets. To this end, we present The MiniPile Challenge, where one pre-trains a language model on a diverse text corpus containing at most 1M documents. MiniPile is a 6GB subset of the deduplicated 825GB The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using $k$ -means, and (3) filter out low-quality clusters. To verify MiniPile's suitability for language model pre-training, we use it to pre-train a BERT and T5…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kensho-technologies/timtc_vocabs_models
pytorch

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · WordPiece · Weight Decay · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Adafactor