Aggregating Data for Optimal and Private Learning

Sushant Agarwal; Yukti Makhija; Rishi Saket; Aravindan Raghuveer

arXiv:2411.19045·cs.LG·December 2, 2024

Aggregating Data for Optimal and Private Learning

Sushant Agarwal, Yukti Makhija, Rishi Saket, Aravindan Raghuveer

PDF

Open Access

TL;DR

This paper investigates how to optimally partition data into bags for MIR and LLP frameworks to maximize utility in downstream tasks, providing theoretical guarantees and privacy considerations.

Contribution

It introduces a method to find near-optimal data partitioning strategies based on clustering, with utility guarantees and privacy enhancements, applicable to GLMs.

Findings

01

Optimal bagging reduces to clustering feature vectors or labels.

02

Theoretical utility guarantees for different loss functions.

03

Bagging mechanisms can be made label-differentially private.

Abstract

Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are learning frameworks arising in many applications, where the training data is partitioned into disjoint sets or bags, and only an aggregate label i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag's labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors or the labels with respect to natural objectives such as $k$ -means. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCryptography and Data Security

MethodsLinear Regression