Building Legal Datasets

Jerrold Soh

arXiv:2111.02034·cs.LG·November 4, 2021

Building Legal Datasets

Jerrold Soh

PDF

Open Access

TL;DR

This paper emphasizes the importance of legal compliance in building datasets for data-centric AI, reviewing legal obligations, analyzing impacts on ML pipelines, and proposing a framework for creating lawful datasets.

Contribution

It provides a comprehensive review of legal obligations and introduces a practical framework for constructing legally compliant ML datasets.

Findings

01

Legal obligations significantly influence dataset construction

02

Data laws impact ML pipeline design and data sharing

03

A framework aids in building legally compliant datasets

Abstract

Data-centric AI calls for better, not just bigger, datasets. As data protection laws with extra-territorial reach proliferate worldwide, ensuring datasets are legal is an increasingly crucial yet overlooked component of ``better''. To help dataset builders become more willing and able to navigate this complex legal space, this paper reviews key legal obligations surrounding ML datasets, examines the practical impact of data laws on ML pipelines, and offers a framework for building legal datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Ethics and Social Impacts of AI · Digital and Cyber Forensics