TL;DR
This paper introduces a method to identify Chinese tokens in LLM vocabularies that indicate potentially polluted training data, and analyzes their prevalence and implications in various models including GPT-4o.
Contribution
It provides a formal definition of polluted Chinese tokens, develops a detection method, and studies their correlation with training data pollution across multiple LLMs.
Findings
Over 23% of long Chinese tokens in GPT's vocabulary are related to pornography or gambling.
The detection method accurately estimates pollution levels in datasets like C4 and Pile.
GPT-4o's training data likely contains around 0.5% of webpages related to specific content.
Abstract
Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
