MYCROFT: Towards Effective and Efficient External Data Augmentation

Zain Sarwar; Van Tran; Arjun Nitin Bhagoji; Nick Feamster; Ben Y.; Zhao; Supriyo Chakraborty

arXiv:2410.08432·cs.LG·October 14, 2024

MYCROFT: Towards Effective and Efficient External Data Augmentation

Zain Sarwar, Van Tran, Arjun Nitin Bhagoji, Nick Feamster, Ben Y., Zhao, Supriyo Chakraborty

PDF

Open Access 3 Reviews

TL;DR

Mycroft is a data-efficient method that helps ML practitioners select valuable external data sources with minimal data sharing, improving model performance rapidly and securely across various tasks.

Contribution

It introduces a novel approach combining feature space distances and gradient matching to identify informative data subsets from private sources.

Findings

01

Converges rapidly to full-data performance

02

Robust to noise in data sources

03

Effectively ranks data owners by utility

Abstract

Machine learning (ML) models often require large amounts of data to perform well. When the available data is limited, model trainers may need to acquire more data from external sources. Often, useful data is held by private entities who are hesitant to share their data due to propriety and privacy concerns. This makes it challenging and expensive for model trainers to acquire the data they need to improve model performance. To address this challenge, we propose Mycroft, a data-efficient method that enables model trainers to evaluate the relative utility of different data sources while working with a constrained data-sharing budget. By leveraging feature space distances and gradient matching, Mycroft identifies small but informative data subsets from each owner, allowing model trainers to maximize performance with minimal data exposure. Experimental results across four tasks in two…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1.Clear Motivation: The paper addresses a real and relevant problem in machine learning, especially in settings where data sharing is constrained by privacy or budget. 2.Methodological Rigor: Mycroft leverages both gradient-based functional similarity and feature similarity to enhance data selection. The approach is well-supported by mathematical modeling and empirical evidence. 3.Comprehensive Experiments: The authors conduct thorough experiments across multiple domains (e.g., vision, tabular d

Weaknesses

1.Limited Novelty in Data Selection Techniques: The core techniques (functional similarity via gradient matching and feature similarity) are adaptations of existing methods, which limits the novelty. While combining them is beneficial, this integration may not be sufficient to stand out as a fundamentally new approach. 2.Scalability Concerns: Although the authors provide complexity analyses, the scalability of Mycroft, especially with high-dimensional datasets, remains a concern. For larger data

Reviewer 02Rating 3Confidence 3

Strengths

- The problem of collaboration between model trainers and data owners is relevant in many contexts. - The problem formulation, Equation (1), seems novel. - The proposed approach, Mycroft, seems to solve the subset selection problem well (Section 4). - Additionally, Mycroft seems to have additional regularization properties (Section 5). - An observation that an earlier checkpoint is better for gradient matching (Section 4.3) is interesting.

Weaknesses

- RetrieveTopK function in Algorithm 3 is confusing. The textual explanation suggests that this function is a sorting "greedy heuristic." However, it is hard to understand how this function is "ensuring coverage of $D^{hard}$." - Algorithms 4 and 5 are hard to understand without textual description. - The inequality below Equation (5) in the proof of Theorem 3.1 does not match the definition of $e_\lambda$ (Equation (1)). From what I see, the proof suggests that the first gradient-matching term

Reviewer 03Rating 3Confidence 3

Strengths

- This paper is easy to follow .

Weaknesses

- It is not clear about the motivation of this paper, privacy protection, or others, it also lacks some real scenarios, it makes me feel like dataset condensation or distillation. - One of the challenges in 2.1 is the reluctance to share data due to privacy, so why can the MySoft framework confirm that they are willing to share a small amount of data? There should still be privacy in there. These challenges are also related to FL. - The problem this paper investigated do not seem to be very real

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Advanced Data Storage Technologies · Neural Networks and Applications