Learning to Limit Data Collection via Scaling Laws: A Computational Interpretation for the Legal Principle of Data Minimization
Divya Shanmugam, Samira Shabanian, Fernando Diaz, Mich\`ele Finck,, Asia Biega

TL;DR
This paper introduces FIDO, a framework that interprets data minimization in machine learning by estimating performance curves to determine optimal data collection limits, aligning with GDPR principles.
Contribution
FIDO provides a novel, performance-based data collection stopping criterion using a piecewise power law model, bridging legal principles and technical data minimization methods.
Findings
FIDO accurately estimates performance curves across datasets.
The framework effectively determines when to stop data collection.
Many curve families overestimate the value of additional data.
Abstract
Modern machine learning systems are increasingly characterized by extensive personal data collection, despite the diminishing returns and increasing societal costs of such practices. Yet, data minimisation is one of the core data protection principles enshrined in the European Union's General Data Protection Regulation ('GDPR') and requires that only personal data that is adequate, relevant and limited to what is necessary is processed. However, the principle has seen limited adoption due to the lack of technical interpretation. In this work, we build on literature in machine learning and law to propose FIDO, a Framework for Inhibiting Data Overcollection. FIDO learns to limit data collection based on an interpretation of data minimization tied to system performance. Concretely, FIDO provides a data collection stopping criterion by iteratively updating an estimate of the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Age of Information Optimization · Data Quality and Management
