Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D., Manning, Dan Jurafsky, Daniel E. Ho

TL;DR
This paper introduces the Pile of Law, a large open-source legal dataset, and proposes a law-based filtering approach to improve responsible data curation for large language models, aiming to reduce harmful content.
Contribution
It presents a new legal dataset and a law-grounded filtering methodology, enabling models to learn content restrictions directly from data, addressing ethical concerns in pretraining.
Findings
The Pile of Law dataset contains 256GB of legal and administrative data.
Legal norms can be distilled into actionable filtering lessons.
Researchers can learn filtering rules directly from the dataset.
Abstract
One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BSC-LT/salamandra-7b-instructmodel· 81k dl· ♡ 7781k dl♡ 77
- 🤗pile-of-law/legalbert-large-1.7M-1model· 6 dl· ♡ 156 dl♡ 15
- 🤗pile-of-law/legalbert-large-1.7M-2model· 211 dl· ♡ 72211 dl♡ 72
- 🤗pile-of-law/distilbert-base-uncased-finetuned-eoir_privacymodel· 3 dl· ♡ 53 dl♡ 5
- 🤗thomsonreuters/budgetlongformer-diversemodel· 1 dl· ♡ 101 dl♡ 10
- 🤗BSC-LT/salamandra-7bmodel· 355 dl· ♡ 29355 dl♡ 29
- 🤗BSC-LT/salamandra-2bmodel· 1.3k dl· ♡ 251.3k dl♡ 25
- 🤗BSC-LT/salamandra-2b-instructmodel· 6.3k dl· ♡ 276.3k dl♡ 27
- 🤗robbiemu/salamandra-2b-instructmodel· 92 dl92 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-7b-instruct-ggufmodel· 141 dl141 dl
Videos
Taxonomy
TopicsArtificial Intelligence in Law
MethodsHigh-Order Consensuses
