Efficient Ensembles Improve Training Data Attribution

Junwei Deng; Ting-Wei Li; Shichang Zhang; Jiaqi Ma

arXiv:2405.17293·cs.LG·May 28, 2024

Efficient Ensembles Improve Training Data Attribution

Junwei Deng, Ting-Wei Li, Shichang Zhang, Jiaqi Ma

PDF

Open Access 3 Reviews

TL;DR

This paper introduces two efficient ensemble strategies, DROPOUT ENSEMBLE and LORA ENSEMBLE, that significantly reduce training and serving costs in training data attribution methods while maintaining high attribution accuracy.

Contribution

It proposes novel ensemble strategies that eliminate the need for fully independent training, improving efficiency in training data attribution without sacrificing effectiveness.

Findings

01

Reduce training time by up to 80%

02

Cut serving time by up to 60%

03

Maintain similar attribution efficacy to naive ensembles

Abstract

Training data attribution (TDA) methods aim to quantify the influence of individual training data points on the model predictions, with broad applications in data-centric AI, such as mislabel detection, data selection, and copyright compensation. However, existing methods in this field, which can be categorized as retraining-based and gradient-based, have struggled with the trade-off between computational efficiency and attribution efficacy. Retraining-based methods can accurately attribute complex non-convex models but are computationally prohibitive, while gradient-based methods are efficient but often fail for non-convex models. Recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution efficacy. However, this approach remains impractical for very large-scale applications. In…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

The authors have motivated their problem well and seen the need for developing better ensembling methods based on simple techniques as opposed to naive ensembling which could be retraining or fine-tuning models fully. They base their development on the TRAK approach which proposed ensembling after using random projections of gradients to counter the computational expense of determining attribution scores. The authors also differentiate between training, serving and space cost. They also use so

Weaknesses

These are some comments that the authors may consider: 1. While the methods do improve performance, they do not help scale TRAK to larger models since TRAK approach has a huge storage cost. This also seems to lead the authors to test with smallish models. 2. It’s not clear how the LORA approach compares with the dropout approach. Is there any way to compare them with each other? Also the LORA has only been demonstrated for the Music Transformer model. 3. If the authors really want to demonstrat

Reviewer 02Rating 5Confidence 4

Strengths

1. This paper is well-organized, with a clear and logical structure that makes it easy to follow and understand. 2. The proposed strategies are technically sound and could significantly reduce the training time and space costs compared to the naive independent ensemble method.

Weaknesses

1. The authors proposed two strategies, DROPOUT ENSEMBLE and LORA ENSEMBLE. However, it is unclear which might perform better and their respective advantages in different scenarios. 2. Given that the computational time of TracIn [1] is also quite low, the comparative advantages of DROPOUT ENSEMBLE over TracIn as well as the reasons are not well stated. 3. The idea of LORA ENSEMBLE appears similar to GEX [2]. It could be helpful to include GEX in the related work section and discuss the

Reviewer 03Rating 8Confidence 3

Strengths

The method is straightforward to understand and implement since it relies primarily on the use of widely available modules (e.g. dropout). The method is flexible and can be applied across a wide variety of settings. The authors demonstrate that their method can perform attribution at a similar level to the naive independent ensemble method with far less training.

Weaknesses

If I understand the method correctly, it should generally be expected that, while training time significantly improves using this method, serving time would be more expensive. If this is the case then it should be made clearer in the paper, ideally with some quantification of this cost. While the authors perform experiments across a variety of methods, models, tasks, and metrics, the full grid of possible evaluations is not complete. For example, a comparison between the lora based and dropout

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification

MethodsDropout