CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection
Soumi Das, Manasvi Sagarkar, Suparna Bhattacharya, Sourangshu, Bhattacharya

TL;DR
CheckSel introduces an efficient two-phase method for data valuation that combines checkpoint selection with online sparse approximation, significantly improving accuracy and efficiency in data subset selection for AI models.
Contribution
The paper presents CheckSel, a novel online checkpoint selection algorithm inspired by Orthogonal Matching Pursuit, and explores its application in domain adaptation for data valuation.
Findings
Outperforms recent baseline methods by up to 30% in test accuracy
Maintains similar computational burden as existing methods
Effective in both standalone and domain adaptation settings
Abstract
Data valuation and subset selection have emerged as valuable tools for application-specific selection of important training data. However, the efficiency-accuracy tradeoffs of state-of-the-art methods hinder their widespread application to many AI workflows. In this paper, we propose a novel 2-phase solution to this problem. Phase 1 selects representative checkpoints from an SGD-like training algorithm, which are used in phase-2 to estimate the approximate training data values, e.g. decrease in validation loss due to each training point. A key contribution of this paper is CheckSel, an Orthogonal Matching Pursuit-inspired online sparse approximation algorithm for checkpoint selection in the online setting, where the features are revealed one at a time. Another key contribution is the study of data valuation in the domain adaptation setting, where a data value estimator obtained using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Data Stream Mining Techniques
