Scalability vs. Utility: Do We Have to Sacrifice One for the Other in   Data Importance Quantification?

Ruoxi Jia; Fan Wu; Xuehui Sun; Jiacen Xu; David Dao; Bhavya Kailkhura,; Ce Zhang; Bo Li; Dawn Song

arXiv:1911.07128·cs.LG·April 27, 2021

Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?

Ruoxi Jia, Fan Wu, Xuehui Sun, Jiacen Xu, David Dao, Bhavya Kailkhura,, Ce Zhang, Bo Li, Dawn Song

PDF

Open Access 1 Repo

TL;DR

This paper compares various data importance quantification methods, especially Shapley value approximations, analyzing their utility and scalability across multiple machine learning workflows, and proposes a scalable, effective approach using KNN surrogates.

Contribution

It provides a novel theoretical comparison of importance quantification methods and introduces a scalable Shapley value approximation using KNN surrogates that maintains utility.

Findings

01

KNN-based Shapley approximation achieves comparable utility to existing methods.

02

The proposed method offers significant scalability improvements, often by orders of magnitude.

03

Theoretical analysis justifies the advantage of the KNN surrogate over leave-one-out error.

Abstract

Quantifying the importance of each training point to a learning task is a fundamental problem in machine learning and the estimated importance scores have been leveraged to guide a range of data workflows such as data summarization and domain adaption. One simple idea is to use the leave-one-out error of each training point to indicate its importance. Recent work has also proposed to use the Shapley value, as it defines a unique value distribution scheme that satisfies a set of appealing properties. However, calculating Shapley values is often expensive, which limits its applicability in real-world applications at scale. Multiple heuristics to improve the scalability of calculating Shapley values have been proposed recently, with the potential risk of compromising their utility in real-world applications. \textit{How well do existing data quantification methods perform on existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

easeml/datascope
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Data Stream Mining Techniques · Imbalanced Data Classification Techniques