No News is Good News: A Critique of the One Billion Word Benchmark
Helen Ngo, Jo\~ao G.M. Ara\'ujo, Jeffrey Hui, Nicholas Frosst

TL;DR
This paper critiques the One Billion Word Benchmark, highlighting its limitations due to temporal distribution shifts and harmful content, and suggests it may not be ideal for evaluating language models.
Contribution
The paper provides an analysis of the dataset's temporal and content issues, proposing that it is unsuitable for consistent language model evaluation.
Findings
Models trained on recent data perform worse on the benchmark over time.
The dataset contains harmful and outdated content.
Distributional shift affects the reliability of the benchmark.
Abstract
The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl, commonly used to measure language modeling ability in natural language processing. We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. Analysis of this corpus reveals that it contains several examples of harmful text, as well as outdated references to current events. We suggest that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and discuss potential impact and considerations for researchers building language models and evaluation datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
