Realistic Evaluation of Deep Partial-Label Learning Algorithms

Wei Wang; Dong-Dong Wu; Jindong Wang; Gang Niu; Min-Ling; Zhang; Masashi Sugiyama

arXiv:2502.10184·cs.LG·February 17, 2025

Realistic Evaluation of Deep Partial-Label Learning Algorithms

Wei Wang, Dong-Dong Wu, Jindong Wang, Gang Niu, Min-Ling, Zhang, Masashi Sugiyama

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PLENCH, a comprehensive benchmark for deep partial-label learning, addressing issues in model selection, inconsistent experimental settings, and the lack of real-world datasets, to enable fairer and more practical evaluations.

Contribution

It presents the first systematic benchmark for deep PLL, proposes novel model selection criteria with theoretical guarantees, and introduces a real-world partial-label image dataset.

Findings

01

Early algorithms can outperform recent complex models.

02

Inconsistent experimental settings hinder fair evaluation.

03

The new dataset enables realistic testing of PLL algorithms.

Abstract

Partial-label learning (PLL) is a weakly supervised learning problem in which each example is associated with multiple candidate labels and only one is the true label. In recent years, many deep PLL algorithms have been developed to improve model performance. However, we find that some early developed algorithms are often underestimated and can outperform many later algorithms with complicated designs. In this paper, we delve into the empirical perspective of PLL and identify several critical but previously overlooked issues. First, model selection for PLL is non-trivial, but has never been systematically studied. Second, the experimental settings are highly inconsistent, making it difficult to evaluate the effectiveness of the algorithms. Third, there is a lack of real-world image datasets that can be compatible with modern network architectures. Based on these findings, we propose…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

- The paper is well-written and accessible. - Model selection in PLL is a relevant, underexplored topic. - The proposed dataset in a realistic PLL setting adds significant value to the PLL literature. - Despite the model selection criteria's simplicity, the authors establish a solid theoretical link to expected accuracy and demonstrate its benefits. - The experiments are extensive, covering many PLL algorithms from top venues.

Weaknesses

While I think that this paper is really good, it would be nice if the authors can clarify the following weaknesses: - **Model selection criteria:** While the criteria are intuitive and theoretically analyzed, presenting Oracle Accuracy as a contribution is problematic. As also noted by the authors, OA is standard in the literature and, thus, it should be presented as a baseline, not as a novel contribution. IMO the contribution statement in the introduction needs to be slightly adapted. - **Use

Reviewer 02Rating 8Confidence 3

Strengths

- The authors make a significant contribution by being the first to systematically investigate model selection problems in partial-label learning (PLL), addressing a gap in the existing literature. The paper presents a comprehensive PLL benchmark that includes 27 algorithms and 11 real-world datasets. - A notable strength of the paper is the introduction of PLCIFAR10, a new benchmark dataset for PLL featuring human-annotated partial labels. This dataset provides an effective and realistic testbe

Weaknesses

N/A

Reviewer 03Rating 6Confidence 3

Strengths

The paper has pointed out a critical issue in PLL research where many recent studies try to achieve the state-of-the-art results by considering an unfair setting when benchmarking with other prior methods. This potentially misleads the research direction where the performance of those prior methods can even out-perform many recent PLL approaches. In the standard setting of PLL, the ground truth labels are not available in validation sets. Hence, it is difficult to tune the hyper-parameters of t

Weaknesses

**Confusing terminologies**: *model selection* vs *hyper-parameter tuning* In the paper, the authors argue that the mismatch of the validation setting results in bad model selection. In fact, to what I understand, what that means is hyper-parameter tuning, not model selection. Hyper-parameter tuning is indeed a subset of model selection, but not the other way around. Model selection is a terminology in machine learning and also means things like variable selection and actual model choice (funct

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Rough Sets and Fuzzy Logic · Web Applications and Data Management