Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods
Jiamian Hu, Yuanyuan Hong, Yihua Chen, He Wang, Moriaki Yasuhara

TL;DR
The Noisy Ostracods dataset provides a challenging real-world benchmark for evaluating robust machine learning methods on fine-grained, imbalanced classification tasks with diverse noise types, highlighting current approaches' limitations.
Contribution
This paper introduces the Noisy Ostracods dataset, a novel, complex dataset for genus and species classification with diverse real-world noise, and evaluates existing robust learning techniques on it.
Findings
Current robust methods show limited improvements on the dataset.
Noise detection methods underperform compared to simple ensembling.
The dataset reveals significant challenges for existing noise-robust algorithms.
Abstract
We present the Noisy Ostracods, a noisy dataset for genus and species classification of crustacean ostracods with specialists' annotations. Over the 71466 specimens collected, 5.58% of them are estimated to be noisy (possibly problematic) at genus level. The dataset is created to addressing a real-world challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diverse noises from multiple sources. Firstly, the noise is open-set, including new classes discovered during curation that were not part of the original annotation. The dataset has pseudo-classes, where annotators misclassified samples that should belong to an existing class into a new pseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance factor = 22429. This presents a unique challenge for robust machine learning methods, as existing approaches have not been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsStatistical and Computational Modeling
