Provable Training Set Debugging for Linear Regression
Xiaomin Zhang, Xiaojin Zhu, Po-Ling Loh

TL;DR
This paper develops a provable method for identifying and correcting contaminated data points in linear regression models, combining theoretical guarantees with practical algorithms and game-theoretic insights.
Contribution
It introduces a Lasso-based debugging algorithm with theoretical guarantees, analyzes a game-theoretic model of data contamination, and demonstrates practical effectiveness through case studies.
Findings
The Lasso-based method reliably identifies buggy points under certain conditions.
A theoretical condition allows the bug generator to fool the debugger, but it is unlikely in practice.
Empirical results show successful debugging with natural data augmentation strategies.
Abstract
We investigate problems in penalized -estimation, inspired by applications in machine learning debugging. Data are collected from two pools, one containing data with possibly contaminated labels, and the other which is known to contain only cleanly labeled points. We first formulate a general statistical algorithm for identifying buggy points and provide rigorous theoretical guarantees under the assumption that the data follow a linear model. We then present two case studies to illustrate the results of our general theory and the dependence of our estimator on clean versus buggy points. We further propose an algorithm for tuning parameter selection of our Lasso-based algorithm and provide corresponding theoretical guarantees. Finally, we consider a two-person "game" played between a bug generator and a debugger, where the debugger can augment the contaminated data set with cleanly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Statistical Methods and Inference
