GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan,, Rishabh Iyer

TL;DR
GLISTER is a novel framework for data subset selection that enhances the efficiency and robustness of training deep models by optimizing for validation performance, applicable to various loss functions and learning scenarios.
Contribution
Introduces Glister, a bi-level optimization-based data selection method that improves training efficiency and robustness, with an online algorithm and active learning extension.
Findings
Reduces training time while maintaining accuracy.
Improves robustness under label noise and class imbalance.
Enhances batch active learning performance.
Abstract
Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several recent papers try to make the training process more efficient and robust. However, most existing work either focuses on robustness or efficiency, but not both. In this work, we introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework. We formulate Glister as a mixed discrete-continuous bi-level optimization problem to select a subset of the training data, which maximizes the log-likelihood on a held-out validation set. Next, we propose an iterative online algorithm Glister-Online, which performs data selection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Imbalanced Data Classification Techniques
