Data Debugging is NP-hard for Classifiers Trained with SGD

Zizheng Guo; Pengyu Chen; Yanzhang Fu; Dongjing Miao

arXiv:2408.01365·cs.CC·August 5, 2024

Data Debugging is NP-hard for Classifiers Trained with SGD

Zizheng Guo, Pengyu Chen, Yanzhang Fu, Dongjing Miao

PDF

Open Access

TL;DR

This paper proves that data debugging for classifiers trained with SGD is generally NP-hard, but can be efficiently solved for linear loss functions, providing a theoretical foundation for future algorithm development.

Contribution

The paper establishes the NP-hardness of data debugging for SGD-trained classifiers and identifies cases where it can be solved efficiently, advancing theoretical understanding.

Findings

01

NP-complete for general loss functions and high dimensions

02

Polynomial-time solution for linear loss functions

03

Provides complexity insights for hinge-like loss functions

Abstract

Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem effectively. This leaves an open issue whether there exists an efficient algorithm to find the subset such that the model obtained by retraining on it has a better accuracy. To answer this open question and provide theoretical basis for further study on developing better algorithms for data debugging, we investigate the computational complexity of the problem named Debuggable. Given a machine learning model $M$ obtained by training on dataset $D$ and a test instance $(x_{test}, y_{test})$ where $M (x_{test}) \neq = y_{test}$ , Debuggable is to determine whether there exists a subset $D^{'}$ of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Artificial Intelligence in Healthcare

MethodsStochastic Gradient Descent