NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful
Andrey Ignatov, Grigory Malivenko

TL;DR
This paper critically examines the NCT-CRC-HE dataset for colorectal cancer histopathology, revealing biases and artifacts that inflate model performance, and demonstrates that simple models can achieve high accuracy, questioning the dataset's reliability.
Contribution
The study identifies dataset biases and artifacts affecting histopathological image analysis and shows that simple models outperform complex ones on this dataset, highlighting issues in current evaluation practices.
Findings
Dataset contains color normalization issues and JPEG artifacts.
Simple models achieve over 50% accuracy, complex models over 97.7%.
The dataset's biases may overestimate model performance.
Abstract
Numerous deep learning-based solutions have been proposed for histopathological image analysis over the past years. While they usually demonstrate exceptionally high accuracy, one key question is whether their precision might be affected by low-level image properties not related to histopathology but caused by microscopy image handling and pre-processing. In this paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used in numerous prior works and show that both this dataset and the obtained results may be affected by data-specific biases. The most prominent revealed dataset issues are inappropriate color normalization, severe JPEG artifacts inconsistent between different classes, and completely corrupted tissue samples resulting from incorrect image dynamic range handling. We show that even the simplest model using only 3 features per image (red, green and blue color…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging
