A critical look at the current train/test split in machine learning

Jimin Tan; Jianan Yang; Sai Wu; Gang Chen; Jake Zhao (Junbo)

arXiv:2106.04525·cs.LG·June 9, 2021·43 cites

A critical look at the current train/test split in machine learning

Jimin Tan, Jianan Yang, Sai Wu, Gang Chen, Jake Zhao (Junbo)

PDF

Open Access

TL;DR

This paper critically examines the limitations of traditional train/test splits in machine learning, especially in industrial contexts, and proposes an adaptive active learning framework to address data annotation challenges.

Contribution

It introduces a novel adaptive active learning architecture (AAL) that dynamically adjusts data acquisition, improving model training under realistic, resource-constrained scenarios.

Findings

01

AAL improves data efficiency in drug discovery applications.

02

AAL demonstrates generalizability on benchmark datasets like CIFAR-10.

03

Traditional splits are limited in real-world, resource-constrained settings.

Abstract

The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Machine Learning in Materials Science