Realistic Evaluation of Deep Partial-Label Learning Algorithms
Wei Wang, Dong-Dong Wu, Jindong Wang, Gang Niu, Min-Ling, Zhang, Masashi Sugiyama

TL;DR
This paper introduces PLENCH, a comprehensive benchmark for deep partial-label learning, addressing issues in model selection, inconsistent experimental settings, and the lack of real-world datasets, to enable fairer and more practical evaluations.
Contribution
It presents the first systematic benchmark for deep PLL, proposes novel model selection criteria with theoretical guarantees, and introduces a real-world partial-label image dataset.
Findings
Early algorithms can outperform recent complex models.
Inconsistent experimental settings hinder fair evaluation.
The new dataset enables realistic testing of PLL algorithms.
Abstract
Partial-label learning (PLL) is a weakly supervised learning problem in which each example is associated with multiple candidate labels and only one is the true label. In recent years, many deep PLL algorithms have been developed to improve model performance. However, we find that some early developed algorithms are often underestimated and can outperform many later algorithms with complicated designs. In this paper, we delve into the empirical perspective of PLL and identify several critical but previously overlooked issues. First, model selection for PLL is non-trivial, but has never been systematically studied. Second, the experimental settings are highly inconsistent, making it difficult to evaluate the effectiveness of the algorithms. Third, there is a lack of real-world image datasets that can be compatible with modern network architectures. Based on these findings, we propose…
Peer Reviews
Decision·ICLR 2025 Spotlight
- The paper is well-written and accessible. - Model selection in PLL is a relevant, underexplored topic. - The proposed dataset in a realistic PLL setting adds significant value to the PLL literature. - Despite the model selection criteria's simplicity, the authors establish a solid theoretical link to expected accuracy and demonstrate its benefits. - The experiments are extensive, covering many PLL algorithms from top venues.
While I think that this paper is really good, it would be nice if the authors can clarify the following weaknesses: - **Model selection criteria:** While the criteria are intuitive and theoretically analyzed, presenting Oracle Accuracy as a contribution is problematic. As also noted by the authors, OA is standard in the literature and, thus, it should be presented as a baseline, not as a novel contribution. IMO the contribution statement in the introduction needs to be slightly adapted. - **Use
- The authors make a significant contribution by being the first to systematically investigate model selection problems in partial-label learning (PLL), addressing a gap in the existing literature. The paper presents a comprehensive PLL benchmark that includes 27 algorithms and 11 real-world datasets. - A notable strength of the paper is the introduction of PLCIFAR10, a new benchmark dataset for PLL featuring human-annotated partial labels. This dataset provides an effective and realistic testbe
N/A
The paper has pointed out a critical issue in PLL research where many recent studies try to achieve the state-of-the-art results by considering an unfair setting when benchmarking with other prior methods. This potentially misleads the research direction where the performance of those prior methods can even out-perform many recent PLL approaches. In the standard setting of PLL, the ground truth labels are not available in validation sets. Hence, it is difficult to tune the hyper-parameters of t
**Confusing terminologies**: *model selection* vs *hyper-parameter tuning* In the paper, the authors argue that the mismatch of the validation setting results in bad model selection. In fact, to what I understand, what that means is hyper-parameter tuning, not model selection. Hyper-parameter tuning is indeed a subset of model selection, but not the other way around. Model selection is a terminology in machine learning and also means things like variable selection and actual model choice (funct
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Rough Sets and Fuzzy Logic · Web Applications and Data Management
