Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus
Jack Bandy, Nicholas Vincent

TL;DR
This paper creates a preliminary datasheet for the BookCorpus dataset to address documentation gaps, revealing copyright issues, duplicates, genre skews, and other deficiencies to improve transparency in ML dataset usage.
Contribution
It provides the first systematic datasheet for BookCorpus, highlighting key deficiencies and advocating for better dataset documentation in machine learning research.
Findings
BookCorpus likely violates copyright restrictions.
Contains thousands of duplicated books.
Exhibits significant genre skew.
Abstract
Recent literature has underscored the importance of dataset documentation work for machine learning, and part of this work involves addressing "documentation debt" for datasets that have been used widely but documented sparsely. This paper aims to help address documentation debt for BookCorpus, a popular text dataset for training large language models. Notably, researchers have used BookCorpus to train OpenAI's GPT-N models and Google's BERT models, even though little to no documentation exists about the dataset's motivation, composition, collection process, etc. We offer a preliminary datasheet that provides key context and information about BookCorpus, highlighting several notable deficiencies. In particular, we find evidence that (1) BookCorpus likely violates copyright restrictions for many books, (2) BookCorpus contains thousands of duplicated books, and (3) BookCorpus exhibits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Data Quality and Management
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Linear Warmup With Linear Decay · Multi-Head Attention · Residual Connection · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
