Pile of Law: Learning Responsible Data Filtering from the Law and a   256GB Open-Source Legal Dataset

Peter Henderson; Mark S. Krass; Lucia Zheng; Neel Guha; Christopher D.; Manning; Dan Jurafsky; Daniel E. Ho

arXiv:2207.00220·cs.CL·November 30, 2022·44 cites

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D., Manning, Dan Jurafsky, Daniel E. Ho

PDF

Open Access 1 Repo 10 Models 5 Datasets 1 Video

TL;DR

This paper introduces the Pile of Law, a large open-source legal dataset, and proposes a law-based filtering approach to improve responsible data curation for large language models, aiming to reduce harmful content.

Contribution

It presents a new legal dataset and a law-grounded filtering methodology, enabling models to learn content restrictions directly from data, addressing ethical concerns in pretraining.

Findings

01

The Pile of Law dataset contains 256GB of legal and administrative data.

02

Legal norms can be distilled into actionable filtering lessons.

03

Researchers can learn filtering rules directly from the dataset.

Abstract

One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

breakend/pileoflaw
noneOfficial

Models

Datasets

Videos

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset· slideslive

Taxonomy

TopicsArtificial Intelligence in Law

MethodsHigh-Order Consensuses