A data-centric approach for improving ambiguous labels with combined semi-supervised classification and clustering
Lars Schmarje, Monty Santarossa, Simon-Martin Schr\"oder and, Claudius Zelenka, Rainer Kiko, Jenny Stracke, Nina Volkmann and, Reinhard Koch

TL;DR
This paper introduces DC3, a data-centric method combining semi-supervised classification and clustering to better handle ambiguous labels in datasets, improving label quality and model performance.
Contribution
The paper presents DC3, a novel approach that estimates ambiguity and applies classification or clustering, enhancing label refinement and compatibility with existing SSL algorithms.
Findings
7.6% improvement in F1-Score across datasets
7.9% reduction in inner cluster distance
Beneficial for manual label refinement
Abstract
Consistently high data quality is essential for the development of novel loss functions and architectures in the field of deep learning. The existence of such data and labels is usually presumed, while acquiring high-quality datasets is still a major issue in many cases. In real-world datasets we often encounter ambiguous labels due to subjective annotations by annotators. In our data-centric approach, we propose a method to relabel such ambiguous labels instead of implementing the handling of this issue in a neural network. A hard classification is by definition not enough to capture the real-world ambiguity of the data. Therefore, we propose our method "Data-Centric Classification & Clustering (DC3)" which combines semi-supervised classification and clustering. It automatically estimates the ambiguity of an image and performs a classification or clustering depending on that ambiguity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Imaging for Blood Diseases · Machine Learning and Data Classification · Anomaly Detection Techniques and Applications
