Data Mixture Inference: What do BPE Tokenizers Reveal about their   Training Data?

Jonathan Hayase; Alisa Liu; Yejin Choi; Sewoong Oh; Noah A. Smith

arXiv:2407.16607·cs.CL·December 3, 2024·1 cites

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to infer the composition of training data for language models by analyzing BPE tokenizers, revealing insights into the multilingual and domain-specific makeup of popular models.

Contribution

The authors develop a novel linear programming approach to deduce training data proportions from BPE merge rules, enabling analysis of proprietary language model datasets.

Findings

01

GPT-4o and Mistral NeMo are highly multilingual with 39% and 47% non-English data.

02

Llama 3's tokenizer is extended mainly for multilingual use with 48% non-English data.

03

GPT-3.5 and Claude's tokenizers are predominantly trained on code (~60%).

Abstract

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alisawuffles/tokenizer-attack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · LLaMA · Cosine Annealing · Linear Warmup With Cosine Annealing · Residual Connection · Dropout · Adam · Byte Pair Encoding · Layer Normalization · Linear Layer