Towards a statistical theory of data selection under weak supervision
Germain Kolossov, Andrea Montanari, Pulkit Tandon

TL;DR
This paper develops a statistical framework for selecting informative data subsets under weak supervision, demonstrating that strategic data selection can outperform using the full dataset and highlighting limitations of existing methods.
Contribution
It introduces a theoretical approach to data selection under weak supervision, combining mathematical analysis with experiments to improve understanding of optimal sampling strategies.
Findings
Data selection can outperform full dataset training in some cases.
Popular data selection methods may be suboptimal.
Theoretical insights guide better data sampling strategies.
Abstract
Given a sample of size , it is often useful to select a subsample of smaller size to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given unlabeled samples , and to be given access to a `surrogate model' that can predict labels better than random guessing. Our goal is to select a subset of the samples, to be denoted by , of size . We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: ~Data selection can be very effective, in…
Peer Reviews
Decision·ICLR 2024 oral
A thorough study is performed about the general properties, that are beneficial for all subsampling schemes. Importantly, model generalization was well studied, in addition to the task of just solving an optimization problem. Biased to unbiased sampling comparison was very insightful, as there are many works where only unbiased sampling is considered, which appeared to be suboptimal under the presented setting. Additional note after the Reviewer-Authors discussion (review score raised): The p
My main concern is the applicability in general setting and the assumptions in the paper: - There is a concern in that (to my understanding) only the behavior exactly at the the optimum was considered (or at least in a small neighbourhood of the optimum), for example, refering to the equation B.3 (definition of the error based only on optimal values of the parameters); and the assumption B.1.A1. (lack of multiple optimal values). In most non-trivial non-linear models an iterative optimization pr
The paper provides various results that I find interesting: (i) While a standard method for data selection is unbiased sub-sampling, Theorem 1 shows that the error coefficient of unbiased schemes can be arbitrarily larger than that of biased ones. Hence, in many cases, unbiased subsampling is sub-optimal (e.g., Figure 1). (ii) Figure 1 and Theorem 2 provide a setting where ERM using a selected subset of the data can lead to a better model than ERM on the full dataset. (iii) The surrogate mod
I think that some parts of the paper are hard to follow and could be more clearly written (for instance, Sections 4-5). I understand that due to space constraints, presentation could be more challenging. I do not find some other significant weakness.
- The problem seems well-motivated.
- The presentation is quite technical. Readers who are not experts in this area may find this paper hard to follow.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Machine Learning and Algorithms · Advanced Statistical Methods and Models
