Automated Assessment of Ki-67 Labeling Index Using Cell-Level Detection and Classification in Whole-Slide Images
Masayuki Tsuneki, Meng Li, Fahdi Kanavati

TL;DR
This paper presents an AI system that can accurately assess tumor cell proliferation using Ki-67 markers, matching the performance of human pathologists.
Contribution
A novel AI-based system for cell-level Ki-67 labeling index assessment with high accuracy and reproducibility.
Findings
The AI system achieved 98% AUC in classifying Ki-67-positive and -negative nuclei.
The AI's Ki-67 LI assessment showed concordance with pathologists comparable to human inter-observer agreement.
Abstract
Background: The Ki-67 labeling index (LI) is a widely used marker of tumour proliferation, yet its manual assessment is time-consuming and subject to substantial inter-observer variability. Automated methods may improve reproducibility, but their clinical relevance depends on achieving performance comparable to expert pathologists. Method: We evaluated an artificial intelligence (AI)-based, cell-level system for automated Ki-67 LI assessment that detects and classifies individual tumour cell nuclei as Ki-67-positive or -negative. After nuclear detection using a pre-existing cell detection model, a lightweight convolutional neural network classifier operating on nucleus-centred patches was trained, and then applied to cases independently assessed by three pathologists. Agreement between AI-derived and human Ki-67 LI values was compared directly with inter-pathologist agreement across a…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Radiomics and Machine Learning in Medical Imaging · Cell Image Analysis Techniques
1. Introduction
Ki-67 is a nuclear protein expressed in proliferating cells and is widely used as an immunohistochemical marker of cellular proliferation in routine surgical pathology. The Ki-67 labeling index (LI), defined as the proportion of Ki-67-positive tumor nuclei relative to the total number of tumor nuclei, is routinely reported in multiple tumour types and contributes to tumour grading, prognostic stratification, and therapeutic decision-making. For example, Ki-67 evaluation provides important prognostic and predictive information and has become an integral component of routine histopathological reporting, particularly in hormone receptor-positive, HER2-negative breast carcinomas [1]. As a result, accurate and reproducible assessment of Ki-67 LI is of substantial clinical importance, as even modest differences in the reported value may influence risk stratification and treatment selection.
In current clinical practice, Ki-67 LI assessment is performed manually by pathologists, typically through visual estimation or manual counting in selected regions of interest [2,3]. Although this approach is common practice, it is inherently subjective and time-consuming [4,5]. Variability in hotspot selection, counting strategy, and interpretation of staining intensity can lead to differences in reported Ki-67 LI values, particularly in cases with intermediate or borderline proliferation indices where clinical decisions may be affected [3,6]. In addition, inter-laboratory differences in staining protocols, antibody clones, and slide preparation further contribute to measurement variability, complicating standardization efforts across institutions.
With the increasing adoption of whole-slide imaging in routine pathology, there is growing interest in automated methods that may improve the reproducibility and efficiency of Ki-67 assessment. Automated analysis offers the potential to process large tissue areas objectively and consistently, thereby reducing observer-dependent bias and workload. From a clinical perspective, however, the primary question is not whether an automated system introduces novel methodology, but whether it can perform at a level comparable to expert pathologists and operate within the range of established inter-observer variability. Any such system must therefore be evaluated relative to human performance rather than against an abstract technical benchmark, and its output should be interpretable within the framework of existing pathological practice.
Recent advances in computational pathology have enabled reliable detection and classification of individual cell nuclei on histopathological images using deep learning-based methods [7,8,9,10,11,12,13,14]. Cell-level analysis is particularly well suited to Ki-67 assessment [15], as the clinical definition of the labeling index is inherently based on counting individual positive and negative nuclei. By explicitly detecting and classifying individual cells, automated approaches have the potential to closely mirror the pathological assessment process while offering improved consistency, scalability, and traceability of results.
In this study, we evaluate an AI-based, cell-level approach for Ki-67 LI assessment that combines automated nuclear detection with classification of individual nuclei as Ki-67-positive or Ki-67-negative. The system was trained using a dataset of manually annotated cells and applied to a set of cases independently assessed by three pathologists.
Our primary objective was to determine whether the AI-derived Ki-67 labeling index demonstrates agreement with expert pathologists comparable to inter-pathologist agreement. We do not propose a new AI architecture; instead, we treat the AI system as an off-the-shelf tool and focus on a clinically grounded evaluation framework: a traceable, cell-level output (per-nucleus positivity) and benchmarking of AI-derived Ki-67 LI against a multi-pathologist reference with performance interpreted in relation to inter-observer agreement. This design directly addresses the practical question of whether an existing AI system performs within the variability expected in routine practice. By directly comparing AI performance to that of multiple pathologists and situating the results within the context of human inter-observer variability, this study aims to assess the potential clinical utility of automated Ki-67 LI assessment as a reproducible decision-support tool in routine histopathological practice. Such a system may contribute to improved standardization of Ki-67 evaluation [16,17,18], support quality assurance, and ultimately enhance the reliability of proliferation assessment in precision oncology.
2. Methods
2.1. Overview of the Analysis Pipeline
The objective of this study was to develop and evaluate an AI-based method for quantitative assessment of the Ki-67 LI on digitised histopathology images. To do so, an annotated dataset was first constructed to train the AI-based method.
The AI-based method consists of two sequential steps: (1) automatic detection of individual cell nuclei using a deep learning-based instance segmentation algorithm, and (2) classification of detected nuclei as Ki-67-positive or Ki-67-negative, followed by computation of the Ki-67 LI. An overview of this is shown in Figure 1.
2.2. Clinical Cases
For the present retrospective study, a total of 320 human biopsy and surgical cases of Ki-67 immunohistochemical stained histopathological specimens were collected from the surgical pathology files of International University of Health and Welfare, Mita Hospital (Tokyo) after histopathological review of those specimens by surgical pathologists. The 320 specimens consist of 169 breast cancers, 48 lung cancers, 37 stomach cancers, 35 colon cancers, and 31 uterine cancers. The Ki-67 antibody clone was MIB-1. For immunostaining, Dako automated staining was used. The tumour types across the cases were as follows: breast cancers: invasive ductal carcinoma; lung cancers: adenocarcinoma and squamous cell carcinoma; stomach cancers: adenocarcinoma; colon cancers: adenocarcinoma; and uterine cancers: endometrial carcinoma. Figure 2 shows representative samples from breast invasive ductal carcinoma.
The experimental protocol was approved by the ethical board of the International University of Health and Welfare (No. 19-Im-007). All research activities complied with relevant ethical regulations and were performed in accordance with relevant guidelines and regulations in the International University of Health and Welfare, Mita Hospital (Tokyo). Informed consent to use histopathological samples and pathological diagnostic reports for research purposes had previously been obtained from all patients prior to the surgical procedures and the opportunity for refusal to participate in research had been guaranteed by an opt-out manner.
2.3. Whole-Slide Image Preparation
All analyses were performed on digitised histopathology images acquired from immunohistochemically stained Ki-67 slides. Whole-slide images (WSIs) were scanned at high resolution using a Leica Aperio AT2 scanner and stored in a pyramidal format. All WSIs were scanned at a magnification of ×20.
2.4. Pathologist Reference Annotations
Regions of interest (ROIs) containing representative tumour tissue were selected by an experienced pathologist for annotation, making sure that an ROI does not include artefacts, non-tumour tissue, and large areas of background. The size of a given ROI image patch was 512 × 512 pixels.
The annotated dataset was constructed by expert manual annotation of individual nuclei on Ki-67-stained ROI images. Using an in-house annotation software, each nucleus was annotated as either Ki-67-positive or Ki-6-negative based on nuclear staining intensity and morphology, following standard pathological criteria (an example is shown in Figure 3). All stained nuclei were included, regardless of staining intensity; specifically, all tumour nuclei showing any degree of nuclear Ki-67 staining were labelled as positive (i.e., no intensity thresholding was applied). Annotations were reviewed by a senior pathologist to ensure consistency and to minimise ambiguous cases. From across the 320 cases, a total of 142K Ki-67-positive and 340K Ki-67-negative cells were annotated.
For evaluation of inter-observer variability and comparison with the AI-based method, a total of 20 unannotated ROI images were selected, each ROI from a different WSI. The Ki-67 LI estimates were then independently provided by three pathologists, each with more than ten years of experience. Each pathologist assessed the same set of image patches according to their individual routine diagnostic practice. An expert consensus value was defined as the median Ki-67 LI among the three pathologists and was used as the primary reference for comparison with the AI model.
2.5. Cell Detection
Detection of individual cell nuclei was performed using a fine-tuned StarDist algorithm [19,20], a deep learning-based instance segmentation method designed for star-convex object detection in microscopy images. StarDist models each nucleus as a star-convex polygon defined by radial distances from a central point, enabling accurate separation of densely packed and partially overlapping nuclei. The model was fine-tuned from the base model (Versatile H&E nuclei) using a subset of the training images following the same training protocol as described in [19,20]. We used the publicly available code [21] without any modifications.
A StarDist model was applied to the Ki-67-stained image patches to identify and segment individual nuclei. The output of this step consisted of polygonal nuclear masks and corresponding centroid coordinates for each detected cell. Only nuclei detected within tumour regions were retained for subsequent analysis.
2.6. Ki-67 Cell Classification
Following nuclear detection, each segmented nucleus, in patches of size 32 × 32 pixels, was classified as Ki-67-positive or Ki-67-negative. For this purpose, a supervised classification model consisting of a small convolutional neural network (CNN) was trained using a manually curated dataset.
The dataset was split at the case level into approximately 45% for training, 5% for validation, and 50% for testing. The reason for this split is as follows: classifying the cells into Ki-67-positive and Ki-67-negative is an extremely simple task as it is highly correlated with colour. In our initial experiments, even a simple classifier based on colour threshold was enough to achieve 91% AUC. We found that adding further training images in this particular case made no improvement in performance.
The classifier was implemented as a sequential CNN. Input patches were first rescaled to the [0, 1] range by dividing pixel intensities by 255. During training, data augmentation was applied to improve robustness to staining and orientation variation, including random horizontal/vertical flips, small rotations, and modest brightness/contrast jitter. The network then applied two convolutional blocks with 3 × 3 kernels (32 and 64 filters, respectively), each followed by max pooling. The resulting feature maps were flattened and passed through a 128-unit fully connected layer. A single sigmoid-activated output unit produced the probability of Ki-67 positivity (binary classification). Gelu activations were used in the hidden layers. The model was trained with the Adam optimiser, binary cross-entropy loss, and accuracy as the primary metric, for 20 epochs.
To address class imbalance, class weights were computed from the training set based on the inverse of class frequencies and supplied during optimisation. Model selection was performed using a checkpoint callback that saved the weights of the epoch with the lowest validation loss.
Image patches centred on detected nuclei were extracted and used as input to the classifier. The trained model was then applied to all detected nuclei to generate per-cell Ki-67 positivity predictions.
Figure 4 shows representative final classification output on a selected area of a WSI.
2.7. Computation of the Ki-67 Labeling Index (LI)
The Ki-67 LI was computed as the ratio of Ki-67-positive nuclei to the total number of detected nuclei within a given region or case:
For reporting purposes, the Ki-67 LI was expressed as a percentage. This definition is consistent with standard clinical and pathological practice.
2.8. Statistical Analysis
Correlation between AI predictions and pathologist estimates was quantified using Pearson’s correlation coefficient and Spearman’s rank correlation coefficient. Prediction error was assessed using the mean absolute error (MAE).
Bias and limits of agreement between AI predictions and expert consensus were evaluated using Bland–Altman analysis. For threshold-based analyses, clinically relevant Ki-67 cutoffs were applied and categorical agreement was assessed using Cohen’s statistic.
For the threshold-based analyses, Ki-67 LI values were grouped into three ordinal categories using commonly applied clinical cut-offs (<10%, 10–20%, and >20%). These thresholds were chosen to reflect typical low/intermediate/high proliferation strata similar to ranges used in some prior studies [22,23]. The optimal Ki-67 thresholds are not universal and may vary by tumour type, clinical context, and institutional practice; therefore, the cut-off analysis in this study is intended as a pragmatic sensitivity analysis rather than as a tumour-specific recommendation.
All statistical analyses were performed using Python 3.11-based scientific computing libraries. A two-sided significance level of 0.05 was used where applicable. The CNN classifier was implemented using Keras with TensorFlow backend [24].
3. Results
3.1. Classification
The cell classification model was evaluated on a test set consisting of 71K positive and 170K negative nuclei. These nuclei were sampled from held-out cases across the included tumour types (breast, lung, stomach, colon, and uterine cancers) to reflect multi-tumour variability. The receiver operator curve (ROC) is plotted in Figure 5.
3.2. Inter-Pathologist Agreement in Ki-67 Assessment
The primary aim of this analysis was to quantify inter-observer variability in Ki-67 LI assessment and to evaluate whether an AI-based approach achieves agreement comparable to that observed among expert pathologists. Agreement among the three pathologists was assessed using the intraclass correlation coefficient (ICC). The inter-pathologist ICC indicated good agreement, with an ICC of 0.93 (95% CI: 0.82–0.97).
The distributions of Ki-67 LI values across pathologists were broadly similar, although small differences in dispersion were observed (Figure 6).
3.3. Agreement Between AI Predictions and Pathologists
Next, we evaluated the agreement between the AI model and individual pathologists, as well as with the expert consensus defined as the median Ki-67 LI among the three pathologists. Correlation analysis demonstrated a strong association between AI predictions and expert consensus (Pearson’s , Spearman’s ), with a mean absolute error of 3.3 percentage points (Table 1).
When the AI model was included as an additional rater alongside the three pathologists, the overall agreement remained comparable to human-only agreement, with an ICC of 0.95 (95% CI: 0.89–0.98). This indicates that the variability between AI and pathologists falls within the range of inter-observer variability observed among experts.
Scatter plots comparing AI predictions with individual pathologists and expert consensus demonstrated close alignment along the identity line across the full dynamic range of Ki-67 values, with no evident systematic deviation (Figure 7).
3.4. Bias and Limits of Agreement Analysis
Bland–Altman analysis was performed to further characterise systematic differences between AI predictions and expert consensus. The mean bias was percentage points, indicating minimal systematic over- or under-estimation by the AI model (Figure 8).
No proportional bias was observed, and disagreement did not increase with higher Ki-67 LI values.
3.5. Threshold-Based Categorical Agreement
Using clinically relevant Ki-67 thresholds (< , 10–20%, > ), categorical agreement between AI predictions and expert consensus was assessed. The resulting Cohen’s was 0.87, corresponding to substantial agreement. Table 2 shows a confusion matrix.
3.6. Summary of Findings
Taken together, these results demonstrate that the AI-based Ki-67 LI assessment achieves agreement with expert pathologists that is comparable to inter-pathologist agreement. The AI model exhibited low bias, strong correlation with expert consensus, and reduced variability relative to individual raters, supporting its potential utility as a reproducible decision-support tool within established clinical variability.
4. Discussion
In this study, we investigated whether an AI-based, cell-level approach to Ki-67 LI assessment can achieve performance comparable to that of expert pathologists. By explicitly positioning AI performance within the context of inter-pathologist variability, our results address a clinically meaningful question: whether automated Ki-67 assessment can reach a level of agreement consistent with routine pathological practice.
Inter-observer agreement among pathologists was good but imperfect, reflecting the variability inherent in manual Ki-67 assessment. These findings are consistent with previous reports highlighting the limitations of manual Ki-67 evaluation despite its widespread adoption and clinical importance [4,5,25,26].
The AI-based approach demonstrated strong agreement with individual pathologists as well as with expert consensus, with agreement metrics comparable to those observed among pathologists themselves. Notably, when the AI system was included as an additional rater alongside the three pathologists, the overall agreement did not deteriorate. This indicates that the variability between the AI system and human observers lies within the range of human–human variability. From a clinical perspective, this is a crucial observation, as it suggests that the AI system performs at a level consistent with expert interpretation rather than introducing systematic or clinically meaningful divergence.
Bias analysis further showed minimal systematic over- or underestimation of Ki-67 LI by the AI model relative to expert consensus. The limits of agreement between the AI and consensus were comparable to those observed between individual pathologists, supporting the interpretation that most discrepancies reflect the intrinsic uncertainty of Ki-67 assessment rather than algorithmic error.
Beyond agreement metrics, a key advantage of the proposed cell-level framework is its interpretability. Because the Ki-67 LI is derived from explicit detection and classification of individual nuclei, the outputs of the AI system can be directly inspected and verified by pathologists. This transparency distinguishes cell-based approaches from purely region-level or black-box regression models and is likely to facilitate clinical acceptance. Users can visually confirm which nuclei were classified as positive or negative, which is particularly valuable in cases with disagreement, where reviewing individual cell classifications may clarify the source of disagreement and support informed human oversight.
Another practical strength of the proposed method is its scalability and reproducibility. Once trained and validated, the AI system can analyse large WSIs in a consistent manner, without fatigue. This capability is especially advantageous in high-volume diagnostic laboratories, multicentre clinical trials, and large retrospective studies, where manual Ki-67 quantification is both time-consuming and subject to operator variability. Moreover, automated analysis enables more comprehensive assessment of tumour areas, potentially reducing sampling bias arising from limited hotspot selection and improving the representativeness of the measured LI.
Standardisation of Ki-67 assessment remains a long-standing challenge in pathology, despite extensive international efforts [27]. While AI alone cannot address all sources of variability, such as differences in staining protocols or pre-analytical factors, it can provide a stable and reproducible computational layer once slides are digitised. In this context, AI-based quantification may function as a harmonising tool that complements existing standardisation initiatives, helping to reduce observer-dependent variability and facilitating comparison of Ki-67 results across institutions and studies.
Furthermore, access to continuous and reproducible Ki-67 measurements may enable more nuanced use of proliferation indices in both research and clinical practice. Rather than relying solely on rigid categorical thresholds, AI-derived Ki-67 values could support probabilistic risk stratification models or be integrated with other quantitative histopathological and molecular features. Such integration is consistent with the broader direction of computational pathology and data-driven precision oncology.
It is also important to emphasise that the objective of this work is not to replace expert judgement, but to augment it. By providing an objective and reproducible reference, AI-based Ki-67 assessment may help reduce intra- and inter-observer variability, support the training of less experienced pathologists, and increase confidence in difficult or borderline cases. In this sense, the system should be viewed as a tool that strengthens the reliability of pathological evaluation while preserving the central role of the pathologist in diagnostic interpretation and clinical decision-making.
Several limitations of this study should be acknowledged. First, the classification model was trained on data obtained from a single source, which may limit generalisability. Second, only three pathologists participated in this study, all with a similar number of years of experience, which does not capture the full variability in experience between pathologists. Finally, the evaluation was retrospective and conducted on a defined dataset; further prospective validation and assessment across multiple institutions and tumour types will be necessary before routine clinical deployment can be considered.
From a clinical workflow perspective, the primary potential utility of this approach lies in its role as a decision-support tool rather than as a replacement for pathologists. Automated and reproducible Ki-67 quantification may help reduce variability, improve consistency across observers and institutions, and increase efficiency in high-volume settings. In particular, such a system may be especially valuable for cases with borderline Ki-67 values or in environments where access to subspecialty expertise is limited.
5. Conclusions
In conclusion, this study demonstrates that an AI-based, cell-level approach for Ki-67 LI assessment achieves agreement with expert pathologists comparable to inter-pathologist agreement. The AI system exhibited low bias, strong correlation with expert consensus, and variability consistent with routine clinical practice. These findings support the potential clinical utility of automated Ki-67 LI assessment as a reproducible and interpretable decision-support tool in histopathological workflows. Further prospective and multi-institutional studies will be required to confirm generalisability and to define the optimal role of such systems in clinical practice.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ma Q. Liu Y.B. She T. Liu X.L. The role of Ki-67 in HR+/HER 2-breast cancer: A real-world study of 956 patients Breast Cancer Targets Ther.20241611712610.2147/BCTT.S 45161738476641 PMC 10929654 · doi ↗ · pubmed ↗
- 2Dowsett M. Nielsen T.O. A’Hern R. Bartlett J. Coombes R.C. Cuzick J. Ellis M. Henry N.L. Hugh J.C. Lively T. Assessment of Ki 67 in breast cancer: Recommendations from the International Ki 67 in Breast Cancer working group JNCI J. Natl. Cancer Inst.20111031656166410.1093/jnci/djr 39321960707 PMC 3216967 · doi ↗ · pubmed ↗
- 3Abubakar M. Orr N. Daley F. Coulson P. Ali H.R. Blows F. Benitez J. Milne R. Brenner H. Stegmaier C. Prognostic value of automated KI 67 scoring in breast cancer: A centralised evaluation of 8088 patients from 10 study groups Breast Cancer Res.20161810410.1186/s 13058-016-0765-627756439 PMC 5070183 · doi ↗ · pubmed ↗
- 4Shui R. Yu B. Bi R. Yang F. Yang W. An interobserver reproducibility analysis of Ki 67 visual assessment in breast cancer P Lo S ONE 201510 e 012513110.1371/journal.pone.012513125932921 PMC 4416820 · doi ↗ · pubmed ↗
- 5Rimm D.L. Leung S.C. Mc Shane L.M. Bai Y. Bane A.L. Bartlett J.M. Bayani J. Chang M.C. Dean M. Denkert C. An international multicenter study to evaluate reproducibility of automated scoring for assessment of Ki 67 in breast cancer Mod. Pathol.201932596910.1038/s 41379-018-0109-430143750 · doi ↗ · pubmed ↗
- 6Polley M.Y.C. Leung S.C. Mc Shane L.M. Gao D. Hugh J.C. Mastropasqua M.G. Viale G. Zabaglo L.A. Penault-Llorca F. Bartlett J.M. An international Ki 67 reproducibility study J. Natl. Cancer Inst.20131051897190610.1093/jnci/djt 30624203987 PMC 3888090 · doi ↗ · pubmed ↗
- 7Naylor P. LaéM. Reyal F. Walter T. Segmentation of nuclei in histopathology images by deep regression of the distance map IEEE Trans. Med. Imaging 20183844845910.1109/TMI.2018.286570930716022 · doi ↗ · pubmed ↗
- 8Kumar N. Verma R. Anand D. Zhou Y. Onder O.F. Tsougenis E. Chen H. Heng P.A. Li J. Hu Z. A multi-organ nucleus segmentation challenge IEEE Trans. Med. Imaging 2019391380139110.1109/TMI.2019.294762831647422 PMC 10439521 · doi ↗ · pubmed ↗
