Generalized Group Data Attribution

Dan Ley; Suraj Srinivas; Shichang Zhang; Gili Rusak; Himabindu; Lakkaraju

arXiv:2410.09940·cs.LG·October 22, 2024

Generalized Group Data Attribution

Dan Ley, Suraj Srinivas, Shichang Zhang, Gili Rusak, Himabindu, Lakkaraju

PDF

Open Access 3 Reviews

TL;DR

The paper introduces GGDA, a framework that simplifies data attribution by grouping data points, significantly improving computational efficiency while maintaining effectiveness for large-scale machine learning applications.

Contribution

GGDA is a novel, general framework that reduces the computational cost of data attribution methods by grouping data points, enabling scalable and practical applications.

Findings

01

GGDA achieves 10x-50x speedups over standard DA methods.

02

GGDA maintains effectiveness in dataset pruning and noisy label detection.

03

GGDA enables scalable data attribution for large models.

Abstract

Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification. However, existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models. To address this challenge, we introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones. GGDA is a general framework that subsumes existing attribution methods and can be applied to new DA techniques as they emerge. It allows users to optimize the trade-off between efficiency and fidelity based on their needs. Our empirical results demonstrate that GGDA applied to popular DA methods such as Influence Functions, TracIn, and TRAK results in…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

The paper is clearly written and addresses an important problem, namely the resource-intensive nature of many data attribution methods. The proposed solution is clearly explained, and the writing is clear and concise.

Weaknesses

In my opinion, the main weakness of this paper is the novelty and depth of the investigation. As far as I can tell, the paper proposes turning a point-to-point data attribution method into a group-to-group data attribution method by effectively summing the corresponding individual attributions. This does not seem so fundamental a contribution---e.g., the fact that this reduces sample complexity from O(# points) to O(# groups) seems to follow directly by construction, as without loss of generalit

Reviewer 02Rating 3Confidence 4

Strengths

The paper is well-written, clearly defining introduced concepts, and is well-motivated, as improving computational efficiency in data attribution is valuable for large-scale machine learning. The authors investigate a generally applicable approach to enhance the computational efficiency of data attribution methods, as claimed. The use a variety of data (tabular, image, text) modalities to validate their approach in downstream supervised learning tasks.

Weaknesses

Weaknesses 1. The experimental datasets (e.g., MNIST, CIFAR-10) are relatively small, calling into question GGDA’s scalability claims for large-scale ML. Can the method be tested on a larger dataset like ImageNet? Does it maintain an effective compute-fidelity tradeoff as sample size increases? 2. In Section 4, line 272, the authors claim computational advantages for group data attribution. However, in line 265, they note that “a single batched gradient computation is roughly equivalent in runt

Reviewer 03Rating 5Confidence 3

Strengths

The paper's experimental section introduces K-Means clustering in gradient space as part of the grouping strategy. This innovative design improves attribution accuracy. The approach demonstrates significant advantages in different attribution tasks, such as dataset pruning and noisy label detection, validating its applicability across various scenarios.

Weaknesses

1. **Absence of Large-Scale Dataset Experiments**: The experiments primarily focus on small to medium-scale datasets, leaving out truly large-scale datasets (e.g., billion-level data). To better demonstrate GGDA’s scalability, future work should incorporate experiments on large-scale datasets and report both computational efficiency and attribution performance in such scenarios. 2. **Lack of K-Means Analysis**: K-Means plays a vital role in the proposed method's effectiveness, but the paper lack

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Cognitive Computing and Networks · Access Control and Trust

MethodsDataset Pruning · Pruning