HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova

TL;DR
HISTAI is a comprehensive, large-scale open-access dataset of over 60,000 whole slide images with extensive clinical metadata, designed to advance AI research in computational pathology.
Contribution
The paper introduces HISTAI, a novel large-scale, multimodal WSI dataset with detailed annotations and metadata to support robust AI model development in pathology.
Findings
Provides a diverse, annotated dataset for AI training
Enhances reproducibility and clinical relevance in computational pathology
Fills critical gaps in existing WSI datasets
Abstract
Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Digital Imaging for Blood Diseases · Radiomics and Machine Learning in Medical Imaging
