Does Deep Active Learning Work in the Wild?
Simiao Ren, Saad Lahrichi, Yang Deng, Willie J. Padilla, Leslie, Collins, Jordan Malof

TL;DR
This paper critically evaluates the real-world effectiveness of deep active learning (DAL) methods, revealing that most underperform compared to random sampling when hyperparameters are not optimally tuned, highlighting an open challenge for practical deployment.
Contribution
The study systematically assesses eleven DAL methods across multiple benchmarks, demonstrating the impact of a key hyperparameter and identifying diversity-based methods as more reliable in uncertain settings.
Findings
Eight of eleven DAL methods sometimes underperform random sampling.
Only three methods consistently outperform random sampling, all using diversity.
Hyperparameter sensitivity significantly affects DAL performance in real-world scenarios.
Abstract
Deep active learning (DAL) methods have shown significant improvements in sample efficiency compared to simple random sampling. While these studies are valuable, they nearly always assume that optimal DAL hyperparameter (HP) settings are known in advance, or optimize the HPs through repeating DAL several times with different HP settings. Here, we argue that in real-world settings, or in the wild, there is significant uncertainty regarding good HPs, and their optimization contradicts the premise of using DAL (i.e., we require labeling efficiency). In this study, we evaluate the performance of eleven modern DAL methods on eight benchmark problems as we vary a key HP shared by all methods: the pool ratio. Despite adjusting only one HP, our results indicate that eight of the eleven DAL methods sometimes underperform relative to simple random sampling and some frequently perform worse. Only…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. It tests eleven different DAL methods across various problems. 2. It highlights that methods using sample diversity are more reliable. 3. The findings might be useful for people who want to use DAL in practical settings.
1. Setting the pool ratio itself is not meaningful, since we can first use diversity-based measures to determine the subset and then use uncertainty-based measures to get better AL performance. 2. Datasets are too simple. 3. The conclusions of this paper rely on empirical evidence to support their findings. For example, it lacks a formal theoretical framework or mathematical proof to explain why diversity-based methods are inherently more robust. 4. As the author mentioned "The recent study b
1. This paper studies whether deep active learning algorithms are practical in real-world applications, which is an important but overlooked problem. 2. It benchmarks 11 DAL methods across 8 regression tasks, providing a thorough empirical analysis. The authors’ choice of real-world, emerging scientific datasets like aerodynamics and materials science adds practical value, testing DAL methods in diverse, high-impact applications and enhancing the study's relevance. 3. The evaluation shows that
1. The paper only evaluates regression scientific tasks. It will be good to include more complicated vision or NLP tasks to better demonstrate that the conclusion is valid in other scenarios. 2. The entire evaluation is all about a single hyperparameter -- pool ratio, which limits the contribution of the paper. The paper can be more solid if more factors are considered and evaluated. 3. An interesting follow-up problem can be that, if many DAL methods fail to achieve good performance when hype
The article focuses on the critical transition between theoretical research and practical application of DAL, a problem of great importance in DAL research. In addition to empirical evidence about which strategies perform best, the authors investigate critical aspects that make these strategies superior.
There is a significant discrepancy between the motivation of the topic and the subsequent investigation. While choosing DAL hyperparameters in the application is critical, the authors investigate this problem based on a single hyperparameter and 8 smaller regression tasks, which is insufficient empirical evidence. In addition, any other hyperparameters of strategies, model architecture, and model training, for instance, were chosen from previous works, which contradicts the notion that good hype
- The paper is relatively clearly written and easy to follow - Active Learning represents an interesting research area, taking into account that nowadays the amount of data being generated is so huge, it is impossible to be human-labeled.
- There is no scientific contribution - The evaluation is limited. The paper analysis only one hyper-parameter. A more comprehensive analysis would have been preferred and expected - The evaluated methods are old (only two of them are 4 and respectively 5 years old), before 2020.
1. This thesis examines one of the interesting and equally important questions-“Does Deep Active Learning Work in the Wild?”. 2. The authors have done extensive experiments to demonstrate the importance of the hyperparameter pool ratio.
1. The question of whether Deep Active Learning works in real-world settings is indeed important, and this study addresses it primarily by examining the effect of hyperparameters. While this focus on hyperparameters is valuable, it is not fundamentally different from other machine learning tasks, which are also often sensitive to hyperparameter settings. Consequently, the title of the work appears somewhat ambitious, as the actual scope of the study is relatively narrow and could be seen as slig
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Imbalanced Data Classification Techniques
