Towards Accelerated Model Training via Bayesian Data Selection
Zhijie Deng, Peng Cui, Jun Zhu

TL;DR
This paper introduces a Bayesian data selection method that improves training efficiency by effectively handling noisy, biased, and duplicated data, reducing training time without sacrificing accuracy.
Contribution
It proposes a lightweight Bayesian approach combined with zero-shot predictors for better data selection, addressing limitations of previous methods and enabling faster model training.
Findings
Achieves similar accuracy with fewer training iterations on WebVision.
Outperforms existing data selection methods in training efficiency.
Effective in noisy and imbalanced data scenarios.
Abstract
Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Advanced Neural Network Applications
