SoK: Data Minimization in Machine Learning

Robin Staab; Nikola Jovanovi\'c; Kimberly Mai; Prakhar Ganesh; Martin Vechev; Ferdinando Fioretto; Matthew Jagielski

arXiv:2508.10836·cs.LG·February 19, 2026

SoK: Data Minimization in Machine Learning

Robin Staab, Nikola Jovanovi\'c, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski

PDF

TL;DR

This paper provides a comprehensive overview of data minimization in machine learning, emphasizing its importance for privacy and regulation compliance, and introduces a unified framework to guide research and practice in this area.

Contribution

It presents the first systematic analysis of data minimization in ML, offering a unified framework and clarifying terminology, metrics, and trade-offs for practitioners and researchers.

Findings

01

Introduces a general framework for DMML including data pipeline and adversarial models.

02

Systematically reviews existing DMML literature and related methodologies.

03

Helps practitioners identify relevant techniques and understand assumptions in DMML.

Abstract

Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.