Data Debugging is NP-hard for Classifiers Trained with SGD
Zizheng Guo, Pengyu Chen, Yanzhang Fu, Dongjing Miao

TL;DR
This paper proves that data debugging for classifiers trained with SGD is generally NP-hard, but can be efficiently solved for linear loss functions, providing a theoretical foundation for future algorithm development.
Contribution
The paper establishes the NP-hardness of data debugging for SGD-trained classifiers and identifies cases where it can be solved efficiently, advancing theoretical understanding.
Findings
NP-complete for general loss functions and high dimensions
Polynomial-time solution for linear loss functions
Provides complexity insights for hinge-like loss functions
Abstract
Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem effectively. This leaves an open issue whether there exists an efficient algorithm to find the subset such that the model obtained by retraining on it has a better accuracy. To answer this open question and provide theoretical basis for further study on developing better algorithms for data debugging, we investigate the computational complexity of the problem named Debuggable. Given a machine learning model obtained by training on dataset and a test instance where , Debuggable is to determine whether there exists a subset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Artificial Intelligence in Healthcare
MethodsStochastic Gradient Descent
