A critical look at the current train/test split in machine learning
Jimin Tan, Jianan Yang, Sai Wu, Gang Chen, Jake Zhao (Junbo)

TL;DR
This paper critically examines the limitations of traditional train/test splits in machine learning, especially in industrial contexts, and proposes an adaptive active learning framework to address data annotation challenges.
Contribution
It introduces a novel adaptive active learning architecture (AAL) that dynamically adjusts data acquisition, improving model training under realistic, resource-constrained scenarios.
Findings
AAL improves data efficiency in drug discovery applications.
AAL demonstrates generalizability on benchmark datasets like CIFAR-10.
Traditional splits are limited in real-world, resource-constrained settings.
Abstract
The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Machine Learning in Materials Science
