Interpretable Multiple Instance Learning for Hematologic Diagnosis from Peripheral Blood Smears

Siddharth Singi; Shenghuan Sun; Zhanghan Yin; Riya Gupta; Dylan C. Webb; Khawaja H. Bilal; Deepika Dilip; Linlin Wang; Neeraj Kumar; Swaraj Nanda; Nicolas Sanchez; Jacob G. Van Cleave; Brenda Fried; Sean Paulsen; Ethan S. Yan; Ali Kamali; Argho Sarkar; Allyne Manzo; Jeeyeon Baik; Irem S Isgor; Cesar Colorado-Jimenez; Anthony Cardillo; Leonardo Boiocchi; Aijazuddin Syed; David Kim; Brie Kezlarian-Sachs; Maly Fenelus; Alexander Chan; Mariko Yabe; Samuel I McCash; Menglei Zhu; Simon Mantha; Orly Ardon; Lauren McVoy; Wenbin Xiao; Mikhail Roshal; Oscar Lin; Ahmet Dogan; Iain Carmichael; Chad Vanderbilt; Gregory M. Goldgof

PMC · DOI:10.21203/rs.3.rs-6933141/v1·October 31, 2025

Interpretable Multiple Instance Learning for Hematologic Diagnosis from Peripheral Blood Smears

Siddharth Singi, Shenghuan Sun, Zhanghan Yin, Riya Gupta, Dylan C. Webb, Khawaja H. Bilal, Deepika Dilip, Linlin Wang, Neeraj Kumar, Swaraj Nanda, Nicolas Sanchez, Jacob G. Van Cleave, Brenda Fried, Sean Paulsen, Ethan S. Yan, Ali Kamali, Argho Sarkar, Allyne Manzo, Jeeyeon Baik

PDF

Open Access

TL;DR

This paper introduces CAREMIL, an interpretable machine learning framework that improves hematologic diagnosis from blood smears by combining cell-level features with whole-slide predictions.

Contribution

CAREMIL is a novel weakly supervised attention-based MIL framework that outperforms existing methods and provides interpretable insights for hematologic diagnosis.

Findings

01

CAREMIL paired with DeepHeme achieves state-of-the-art performance in diagnosing AML, MDS, and HCL with high AUROC scores.

02

CAREMIL improves performance especially when using out-of-domain encoders like those trained on ImageNet or open-source pathology models.

03

Attention values from CAREMIL highlight diagnostically relevant cells and morphometric signatures, enabling biological interpretability.

Abstract

Accurate diagnosis of hematologic malignancies from peripheral blood smears (PBSs) requires integrating cellular morphology and composition across hundreds of white blood cells. Existing approaches primarily automate single-cell classification and do not provide whole-slide diagnostic predictions. We present a full network that utilizes a highly performative cell-based encoder (DeepHeme) for feature extraction paired with our weakly supervised framework using attention-based multiple instance learning (MIL) that we call CAREMIL (Cell AggRegation, Explainable, Multiple Instance Learning). Upon evaluating various popular image encoders and MIL architectures, the combination of DeepHeme and CAREMIL is the best performing pipeline on our disease classification task. CAREMIL proves to be a robust aggregation function that outperforms the most commonly used slide level aggregation function…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases4

HCL AML hematologic malignancies MDS

Figures5

Click any figure to enlarge with its caption.

A](#F1). Recent computational efforts have focused primarily on automating aspects of PBS interpretation, including single-cell classification and cytometric quantification ([8](#R8)), ([9](#R9)), ([10](#R10)). While these approaches improve efficiency, they fail to capture the integrative reasoning required for whole-slide diagnosis. In contrast to solid tumor pathology, where tiling approaches can leverage preserved tissue architecture ([11](#R11)), ([12](#R12)), hematologic specimens lack cohesive spatial organization([13](#R13)), rendering traditional whole slide image (WSI) analysis metho

C](#F2) the top panel shows that in a sample with acute myeloid leukemia (AML), the high-attention cells consistently included leukemic blasts characterized by fine chromatin and high nuclear-to-cytoplasmic ratios. The next panel shows a case of MDS, with high-attention cells including examples of both dysplastic neutrophils and immature granulocytes which, when circulating in peripheral blood, are characteristic of MDS. The bottom panel shows an HCL sample, where the high-attention cells highlight the characteristic abnormal “hairy cell” lymphocytes associated with the disease.

B](#F3), the continuous line delineates the performance of CAREMIL across varying levels of circulating disease. For clarity, we categorized our test data into different bins: high (>=20%), low (<20%), very low (<=10%), extremely low (<=5%), and invisible leukemia (<=1%). CAREMIL demonstrates high performance at low levels of blasts on real patients compared to synthetic patients, which are represented by the dashed line. Surprisingly, CAREMIL achieves 75% sensitivity, even in cases of with 0% blast count. This suggests that CAREMIL may be recognizing changes in the composition or morphology o

C](#F4) shows that the high attention category contains overrepresentation of monocytes and blasts. Pathologist review revealed that the cells interpreted as blasts and monocytes by DeepHeme in these samples were hairy cells. DeepHeme is trained on patients with no known neoplastic disease. Thus, hairy cells were absent from the training corpora. Despite this limitation, the CAREMIL model determined the features associated with hairy cells was important to the underlying diagnosis despite the encoder model (DeepHeme) not being trained to specifically classify hairy cells. This highlights the s

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Imaging for Blood Diseases · AI in cancer detection · Cell Image Analysis Techniques

Full text

Introduction

1

Accurate diagnosis of hematologic malignancies from peripheral blood smears (PBSs) requires integrating both the morphology of individual white blood cells and the overall cellular composition (1; 2; 3; 4). Peripheral smears serve as a non-invasive, rapid “liquid biopsy”, offering critical information about underlying bone marrow pathology. However, diagnosis often hinges on detecting subtle morphological abnormalities or shifts in cellular proportions, making interpretation highly dependent on expert review(5). These challenges are particularly acute in cases of low circulating disease burden or early-stage dysplasia, where abnormalities may escape detection through routine screening (6; 7). The clinical and computational workflow for PBS analysis is outlined in Figure 1A. Recent computational efforts have focused primarily on automating aspects of PBS interpretation, including single-cell classification and cytometric quantification (8), (9), (10). While these approaches improve efficiency, they fail to capture the integrative reasoning required for whole-slide diagnosis. In contrast to solid tumor pathology, where tiling approaches can leverage preserved tissue architecture (11), (12), hematologic specimens lack cohesive spatial organization(13), rendering traditional whole slide image (WSI) analysis methods poorly suited for blood smears, where proportions of cell types and single-cell morphology are central to diagnostic interpretation, rather than tissue architecture(14; 15). As a result, new frameworks are needed to model PBSs at the single-cell level while enabling slide-level diagnostic predictions. Multiple instance learning (MIL) offers a promising solution for weakly supervised learning in pathology (16; 17; 18; 19; 20; 21), treating individual cells as instances and the PBS as a labeled bag. Attention-based MIL architectures have shown success in solid tumor WSIs by improving both performance and interpretability, but their adaptation to hematologic specimens remains limited. Prior work has applied MIL to blood smear images using general-purpose encoders without attention-based aggregation, but comprehensive strategies for integrating morphology and composition with domain-specific features have not been explored (22).

Here, we present CAREMIL (Cell AggRegation, Explainable, Multiple Instance Learning), an attention-based multiple instance learning framework optimized for hematologic liquid biopsy interpretation. CAREMIL integrates both cell morphology and cytometric composition at the instance level to generate accurate, interpretable slide-level predictions. We evaluate CAREMIL across three diagnostically distinct hematologic malignancies: acute leukemia, characterized by increased circulating blasts; hairy cell leukemia, marked by morphologically abnormal lymphocytes; and myelodysplastic syndromes, distinguished by subtle dysplastic (morphologic) changes across multiple hematologic lineages. To support cell-level representation learning, we leverage DeepHeme (14), a domain-specific encoder trained on hematologic cell classification, applying its feature embeddings for the first time to weakly supervised, slide-level diagnosis.

CAREMIL consistently outperforms gated MIL architectures across multiple encoder types, with the greatest performance gains observed when using in-domain foundational models trained on non-hematologic images. The model accurately identifies diagnostically relevant cell populations without requiring explicit cell-level supervision and remains robust even when upstream classifiers mislabel abnormal morphologic variants. These findings demonstrate the utility of attention-based weakly supervised learning for morphology-informed, quantitative diagnostics in hematology and highlight the potential for broader application across liquid biopsy specimens.

Results

2

Dataset Curation

2.1

We curated a dataset of 1,017 whole slide images (WSIs) of peripheral blood smears (PBSs) scanned at 400x magnification (40x objective), representing a diverse cohort of 573 patients evaluated by the hematopathology service. Each PBS was paired with a bone marrow biopsy and clinical pathology report, with diagnoses established by board-certified hematopathologists based on the full clinical context in the bone marrow report. The dataset includes 307 cases of acute leukemia (AML), 95 cases of myelodysplastic syndromes (MDS), 41 cases of hairy cell leukemia (HCL), and 130 normal (NL) controls. Notably, control cases represent patients with normal overall reports, rather than healthy individuals, reflecting the real-world diagnostic setting. Each WSI contains hundreds of white blood cells, spanning a wide range of disease burden and morphologic representations, including blast-negative AL and subtle dysplasia in MDS. To ensure independence across datasets, train, validation, and test splits were performed at the patient level.

Improving the Attention Mechanism for Slide-Level Predictions

2.2

Multiple instance learning (MIL) provides a weakly supervised framework for training models when only bag-level (i.e. slide-level) labels are available, a setting well-suited to pathology applications where instance-level annotations (e.g., per-cell labels) are often impractical. In MIL, each bag consists of a collection of instances, and the model must learn to aggregate information across instances to make a bag-level prediction. We compare our CAREMIL model with the Gated attention MIL.

Gated attention MIL is common approach used in histology WSIs where attention is is applied by using a gating mechanism (17) [Ilse et al, 2018]. Instance features undergo two separate nonlinear transformations, one with a tanh activation and one with a σ activation. The outputs of these transformations are combined via element-wise multiplication before attention scoring. This gating mechanism allows the model to better modulate instance contributions by enhancing or suppressing different aspects of the feature representation prior to computing attention. However, aggregation remains a weighted sum across instances, and the model primarily focuses on selecting a few highly relevant instances for diagnosis.

CAREMIL builds upon the MIL paradigm by explicitly modeling both the individual morphology of instances and broader distributional properties of the instance population. Instead of aggregating raw cell embeddings through a weighted sum, CAREMIL first computes global statistical summaries (e.g. mean and variance) across the set of instance embeddings. These summary features are combined with the original cell embeddings and jointly processed through a transformer-style self-attention mechanism. This architecture enables CAREMIL to reason both about rare high-attention instances and about subtle global shifts in cell population characteristics, improving robustness and clinical interpretability.

Model evaluation and performance

2.3

CAREMIL’s performance was compared to two different approaches. The first is a set of optimized machine learning classifiers operating on the cell count alone (cytometry). To set a strong baseline, we performed hyperparameter optimization using a grid search over 6 different standard models and 227 different hyperparameter groups. Performance was compared with gated MIL attention (17). We also tested the performance of both the CAREMIL and the Gated MIL architecture on four different image encoders (see Figure 2. The first is DeepHeme, a domain-specific RexNeXt-50-based classifier, trained to identify 23 different cell hematopoietic cell classes found in the normal bone marrow. To create strong baselines for our work we also make comparisons on UNI2-h (23) and Virchow2, self-supervised computational pathology foundation models (24). Lastly, we also use a ResNeXt-50 model trained on ImageNet, to understand the performance of general purpose computer vision models not trained on histopathology. These models and their training hyperparameters are summarized in Supplementary Table 2.

Each approach was evaluated in four distinct clinical contexts, as summarized in Supplementary Table 3. The first task involved identifying acute leukemia (AML) from normal (NL) samples. AL encompasses both acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) and is defined by an increased percentage of blast cells in the peripheral blood or bone marrow.

The second task focused on distinguishing myelodysplastic syndromes (MDS) from normal samples. MDS is characterized by abnormal morphology (dysplasia) in hematopoietic cells. While MDS is typically diagnosed in patients presenting with cytopenias, the ability to detect myelodysplasia morphologically in patients with cytometrically normal blood counts may provide a low-cost method for identifying individuals at risk of developing MDS, or who already have cytogenetically defined MDS. Earlier identification of these patients could enable interventions aimed at preventing disease progression.

The third task aimed to separate hairy cell leukemia (HCL) from normal samples. HCL is defined by the presence of morphologically abnormal lymphocytes with “hairy” cytoplasmic projections, although these projections were rarely observed in this dataset due to slide preparation artifacts. HCL does not necessarily cause abnormal cytometry.

The fourth task involves distinguishing among AL, MDS, HCL, and NL, presenting a significantly more challenging quaternary classification problem compared to binary classification. The increased complexity stems from the need to separate four distinct classes with overlapping morphologic and cytometric features. For instance, AL and MDS may share dysplastic characteristics and overlapping blast counts, while all four classes can exhibit normal cytometric profiles under certain conditions. This task is further complicated by the requirement for the classifier to generalize across all classes while minimizing errors specific to each, highlighting the difficulty of achieving robust and accurate multi-class classification.

Slides were labeled with diagnoses derived from pathology reports at the case level. Model performance was evaluated based on its ability to correctly predict these case-level diagnoses. For example, a diagnosis of acute leukemia (AML) did not require detectable disease in the peripheral blood, as the diagnosis may have been based on findings from the bone marrow. This approach ensured the inclusion of cases with low-level circulating disease that might be identified by the AI system, even when undetectable by automated hematology analyzers or human pathologists.

CAREMIL outperforms Gated MIL on all tasks (see Supplementary Table 5) and performs better irrespective of the image encoder being used (see Figure 2A). Our results also show that the hematology-specific image encoder, DeepHeme, demonstrated substantially better performance compared to large general purpose pathology foundation models like UNI2-h or Virchow2. Both the encoder and the aggregator model architecture play an important role in the overall performance. The CAREMIL model architecture with the DeepHeme image encoder demonstrates superior performance compared to other models and encoders for the hematology domain.

Supplementary Table 4 and 5 summarizes the average AUROC-scores for each model and experiment. In the MDS vs NL and multiclass task, UNI2-h achieved slightly higher performance, though the difference was not significant. Across all four experiments, models using the CAREMIL architecture performed significantly better than their Gated MIL equivalent. Additionally, cytometry-based (e.g. based on cell counts alone) machine learning classifiers consistently outperformed MIL models that relied on ImageNet encoders.

Explainability and patient safety

2.4

AI models deployed in medical environments must include a layer of explainability to ensure patient safety and clinical utility. A key aspect of explainability is enabling physicians to rapidly determine whether the model’s performance is based on sound biology and not spurious correlations. Additionally, adding explainability outputs can facilitate broader analyses on how abnormal cells influence diagnostic decisions. For this reason, we developed a workflow designed to provide pathologists with interpretable outputs for each sample. The output consists of a figure showcasing three key cell subsets (see Supplementary Figure 7) : the cells with the highest attention values (top row), a random selection of cells to represent the whole-slide context (middle row), and the cells with the lowest attention values (bottom row). This structure allows pathologists to confirm whether the model is focusing on diagnostically relevant cells while avoiding irrelevant or morphologically normal cells.

The top row specifically highlights the morphology of disease cells, offering a clear visual confirmation of abnormal cells. In cases of very low disease burden, where abnormal cells may not appear in the random selection, the top row ensures the clinician can still validate the diagnosis. The middle row provides a representative sampling of randomly selected cells, enabling clinicians to quickly estimate the disease burden in the patient by assessing the proportion of abnormal cells. The bottom row allows clinicians to verify that the model correctly identifies and de-emphasizes cell types that are least diagnostically relevant, ensuring it focuses on cells that contribute meaningfully to the predicted diagnosis. While the figures presented in this study are limited to 10 cells per row for visual clarity, a clinical deployment would include a larger number of cells. This would make it possible to accurately estimate disease burden in cases with a low percentage of abnormal cells.

In the context of model explainability, specific patterns of cell morphology were observed for each disease state, corresponding to the high attention cells. Figure 2C is composed of disease specific examples of high attention-based outputs that demonstrate the model’s capacity to align its predictions with clinically relevant morphological features while providing a broader context for disease burden and cell-type distribution.

In Figure 2C the top panel shows that in a sample with acute myeloid leukemia (AML), the high-attention cells consistently included leukemic blasts characterized by fine chromatin and high nuclear-to-cytoplasmic ratios. The next panel shows a case of MDS, with high-attention cells including examples of both dysplastic neutrophils and immature granulocytes which, when circulating in peripheral blood, are characteristic of MDS. The bottom panel shows an HCL sample, where the high-attention cells highlight the characteristic abnormal “hairy cell” lymphocytes associated with the disease.

Supplementary Figure 7 highlights an additional case of MDS. High-attention cells in this case include dysplastic hypogranular neutrophils, monolobated neutrophils, dysplastic erythroblasts exhibiting nuclear-cytoplasmic dyssynchrony, and abnormal immature granulocytes circulating in the peripheral blood. The inclusion of two examples for MDS demonstrates the model’s ability to identify different types of dysplasia across specimens, reflecting the diversity of morphological abnormalities seen in MDS, which represents a family of syndromes, rather than one specific entity. These findings underscore the model’s adaptability in detecting patient-specific manifestations of dysplasia, providing critical information for diagnosis and model explainability.

Limits of detection in synthetic patients

2.5

To evaluate the sensitivity of CAREMIL across varying blast percentages, we generated a dataset of synthetic patients by combining fixed proportions of blasts from leukemia patients with randomly sampled non-blast cells from normal individuals. Leukemia is typically diagnosed when blast percentages exceed 20%. However, lower blast levels may be observed in post-treatment patients, cases where the disease is predominantly confined to the marrow, or for certain disease genotypes. As illustrated in Figure 3A, the probability of correctly diagnosing acute leukemia increases with rising blast percentages. Using a cutoff of 0.5, at 0% blast proportion, the algorithm identifies all but one sample as normal and at 20% the algorithm identifies all samples as acute leukemia.

False Positive Review Reveals True Positive Relapse

2.6

Interestingly, CAREMIL’s performance at low levels of circulating blasts in synthetic datasets appeared substantially lower than in the WSI dataset. The blast percentages in our WSI dataset spans from 0% to 100% in cases of acute leukemia. Slides were included in the study based on pathology reports diagnosing acute leukemia, even if circulating blasts were not detected by hematology analyzers or visual inspection noted in the pathology report. In cases where circulating blast percentages were below 20%, patients typically exhibited higher blast counts in the bone marrow, consistent with the diagnostic criteria for acute leukemia.

In Figure 3B, the continuous line delineates the performance of CAREMIL across varying levels of circulating disease. For clarity, we categorized our test data into different bins: high (>=20%), low (<20%), very low (<=10%), extremely low (<=5%), and invisible leukemia (<=1%). CAREMIL demonstrates high performance at low levels of blasts on real patients compared to synthetic patients, which are represented by the dashed line. Surprisingly, CAREMIL achieves 75% sensitivity, even in cases of with 0% blast count. This suggests that CAREMIL may be recognizing changes in the composition or morphology of the non-blast cell population. However, these changes are not necessarily specific to acute leukemia and could also occur in other hematological disorders. For instance, they may result from marrow damage caused by leukemia cells, leading to alterations in peripheral blood morphology or cytometry. Alternatively, acute leukemia might induce specific but subtle, multifactorial changes in non-blast cells that are not easily recognizable by pathologists, but identifiable by CAREMIL.

Although missed diagnoses in cases with low-level diseases are anticipated, encountering false positives where normal slides are misclassified as acute leukemia is less expected. In our test set, no such instances occurred. However, in the validation set, four cases were identified. Upon review by our pathologist colleagues, two were found in patients with a history of leukemia but no current detectable disease. Another involved a patient who had recently received granulocyte colony stimulating factor (G-CSF) treatment prior to a bone marrow transplant for lymphoma. For patients whose bone marrow stem cells are being harvested for transplant, G-CSF treatment is performed to massively increase circulating hematopoietic stem cells (blasts) prior to harvesting.

The most notable “false positive” involved a patient with a history of acute myeloid leukemia (AML) in remission who presented to the clinic with an enlarged lymph node. Although imaging raised concerns for relapse, initial diagnostic evaluations, including a peripheral blood smear, bone marrow biopsy, bone marrow flow cytometry, and bone marrow biopsy genetic testing for a known FLT3 mutation did not confirm disease recurrence. Based on this workup, the bone marrow biopsy report was classified as morphologically normal, resulting in its inclusion in our test set as a morphologically normal sample. However, a lymph node biopsy confirmed AML, prompting a repeat bone marrow biopsy three weeks later, which also revealed AML with an 80% blast count. This suggests that CAREMIL may have identified the presence of AML approximately three weeks earlier than the standard diagnostic workup.

Model-Derived Morphometric Signatures Reflect Underlying Disease Biology

2.7

The attention values created from the model provide valuable interpreteability beyond the individual patient level. For each individual patient we create an attention panel as shown in Figure 3, through which a pathologist can verify the cells being looked into for each prediction. To evaluate the overall effectiveness of our interpretability mechanism beyond individual attention maps, we compare the cell type distribution within the top 10% most-attended cells in each disease category to the distribution of all cells in the same disease category. This highlights the cell types that are relevant to the decision making for each disease category.

As shown in 4A the relative composition of blast cells in top 10% highest attention cells is 2.5 times higher than in the total population, which indicates that the attention mechanism is correctly identifying blasts as the important cell class for diagnosing AML. Supplemental Figure 4C extends this analysis using a log-odds ratio to highlight the contribution of rare cell classes to the disease signature. Notably, we observe increases in mitotic bodies, orthochromic/polychromatophilic erythroblasts, plasma cells, and basophils. While mitotic bodies are rarely seen in circulation, their presence is elevated in acute leukemia, and the model appears to have captured this pattern. The increases in other rare circulating cell types likely reflect underlying marrow stress, further supporting the model’s ability to detect subtle morphological indicators of disease.

In Figure 4B, the MDS disease signature is illustrated. For MDS patients, the model assigns high attention to lymphocytes, blasts, and basophils. We hypothesize that the increased attention to lymphocytes is due to the relative lymphocytosis observed in patients with myeloid cytopenias, a hallmark feature of MDS. On further examination some patients in the cohort had lymphocyte percentages in excess of 50% of all circulating white blood cells. Likewise, abnormal blasts and basophilia are also commonly seen in MDS.

The comparative attention histogram for HCL shows attention less focused on lymphocytes. Thus, the DeepHeme model does not classify the hairy cells as normal lymphocytes. Figure 4C shows that the high attention category contains overrepresentation of monocytes and blasts. Pathologist review revealed that the cells interpreted as blasts and monocytes by DeepHeme in these samples were hairy cells. DeepHeme is trained on patients with no known neoplastic disease. Thus, hairy cells were absent from the training corpora. Despite this limitation, the CAREMIL model determined the features associated with hairy cells was important to the underlying diagnosis despite the encoder model (DeepHeme) not being trained to specifically classify hairy cells. This highlights the strength of the CAREMIL model, which does not rely on the cell-level classification provided by the encoder. Even when the cell labels are incorrect, the model can learn to identify the important cell morphologies provided by the feature embedding from the encoder.

Model attention aligns with diagnostic cell identity

2.8

The disease signature helps pathologists review the cells the model takes into consideration for diagnosis along with the weight given to each cell. However, this pictorial and manual verification is not enough to prove the model’s capabilities. We want to understand the overlap between the cells that are given a high attention value (top 25% by attention value) by the model vs the diagnostically relevant cells identified for that patient. We choose a sample that has been diagnosed with AML for this analysis, as AML has a simple biomarker of the presence of blast cells. In Fig 5 almost all high attention cells are blast cells showcasing the practical utility of the model to identify important diagnostic markers without the need to explicitly teach the model specific biomarkers for every disease.

Discussion

3

Overall we proposed a novel framework CAREMIL with DeepHeme which is the best performative model for hematological diagnostic tasks. Instead of traditional diagnosis of hematologic cancers which rely on enumerating different blood cell types (the blood differential) and applying heuristic rules alongside morphological analysis of individual cells, our pipeline leverages a hematologic domain-specific encoder to integrate cellular morphology with cytometric features for case-level diagnosis of hematologic malignancies from blood smears.

Unlike prior AI efforts that classify cells individually and base diagnosis on cell differentials alone, our model also integrates cell-level morphologies across the entire case to inform diagnostic predictions. Moreover, conventional cell classifiers often offer limited interpretability, making it difficult to meaningfully involve pathologists in the diagnostic process (25), a limitation CAREMIL explicitly addresses by providing cell-level attention scores.

In computational pathology for tissue sections, whole slide images (WSIs) are analyzed through a tiling approach for slide-level diagnosis. However, these methods are unsuitable for hematology, where diagnostically relevant cells, such as white blood cells (WBCs), are sparsely and randomly distributed. Our proposed methods overcomes these limitations by implementing a cell-aggregation-based multiple instance learning (MIL) method with attention mechanisms that prioritize diagnostically important cells (26) for pathologist review, facilitating accurate slide-level predictions.

Our findings demonstrate that CAREMIL paired with DeepHeme cell image encoding achieves substantial improvements in diagnostic accuracy by combining individual cell-level morphological features with cytometric composition, which is pivotal for diagnosing and staging hematologic malignancies. Notably, our pipeline achieved AUROC scores of 0.999, 0.891 and 0.945 for AML, MDS, and HCL, respectively, underscoring its clinical relevance across diverse disease contexts. We showed that use of DeepHeme outperforms other image encoders, including pathology foundation models and ImageNet-trained encoders. This result highlights the importance of domain-specific feature extraction for arriving at the most accurate diagnosis.

An important aspect of the MIL strategy for hematologic diagnosis is the explainability framework we highlight. By employing an attention-based MIL aggregator, the model identifies cells critical to the diagnostic prediction, allowing pathologists to validate and interpret the model’s rationale in the context of each case. This explainability, alongside performance, is essential for clinical adoption, as it enables medical professionals to ensure model predictions align with established diagnostic criteria, thereby promoting clinician trust (27) and patient safety.

The attention-based weakly supervised approach employed in CAREMIL offers unique insights into underlying disease processes by identifying diagnostically relevant cells, rather than relying solely on quantitative metrics like blast counts. Specific genotypes of AML associated with aggressive phenotypes do not require a specific blast percentage for diagnosis, according to the 5th edition of the World Health Organization Classification of Haematolymphoid Tumours(28). This pipeline can be replicated with different disease labels to build an automated biomarker identification system that will emerge as a valuable tool in the future to stratify high-risk patients, enabling earlier and more precise interventions. This approach can also be used to develop outcome prediction models that are morphology informed, using outcome labels rather than diagnosis labels.

The ability to assign attention values to individual cells based on their relevance to the diagnostic outcome allows for a nuanced interpretation of WSI data that captures both explicit and latent features across the entire sample. This model of disease characterization holds promise for future applications, where attention-based diagnostics may surpass traditional metrics in robustness and specificity, especially for diseases with complex presentations. By focusing on pathologically significant cells rather than fixed thresholds, CAREMIL could enable more personalized and accurate diagnostics that reflect the continuum of disease rather than categorical criteria.

While our techniques demonstrate strong diagnostic potential, several limitations underscore areas for future development. The current approach does not support end-to-end training or fine-tuning, which may be necessary for applications lacking pre-existing in-domain feature encoders (non-hematopoietic cells in cytology applications, for example). Additionally, DeepHeme generates feature embeddings from a relatively modest-sized ResNeXt-50 model, and future implementations may benefit from larger, self-supervised architectures to achieve even better performance, especially as diagnostic complexity increases. (29)

Another consideration is the diagnostic spectrum of the cell-based MIL strategy. This study addresses a four-way classification, yet the full range of hematologic diagnoses is far more extensive. Expanding the study to encompass additional conditions will require large, diverse datasets and further testing to ensure reliability across a broader range of hematologic pathologies. Moreover, while this study focuses solely on peripheral blood smears—a practical choice for screening due to their low cost and minimally invasive nature—a more accurate diagnosis would be achieved by integrating data from bone marrow aspirates, core biopsies, immunohistochemistry, flow cytometry, clinical text histories, and genomic data. Conveniently, the current pipeline provides an encoding strategy that could allow for incorporation in a multimodal study that includes these diverse data types.

The MIL architecture, CAREMIL along with the DeepHeme cell encoder, represents a significant advancement in computational hematology, providing an accurate, interpretable framework for diagnosing hematologic diseases using WSIs of peripheral blood smears. Leveraging a cell-aggregation-based multiple instance learning approach not only preserves essential morphologic information of white blood cells but also achieves superior diagnostic accuracy compared to cytometry-based models. Future work should explore expanding this pipeline’s diagnostic scope and integrating multimodal data to enhance its clinical utility in hematology.

Methods

1

Dataset Collection

1.1

We curated a dataset of 1,017 WSIs of blood smears scanned at 400× optical resolution (i.e. using a 40x objective) using a Hamamatsu S360 WSI scanner. These include 307 acute leakemia, 95 myelodysplastic syndromes, 41 hairy cell leukemias, and 130 morphologically normal control cases (see Supplementary Figure 8A) Scanned images range from 2GB to 20GB and were scanned using a single z-plane with uniform refocusing points across each slide. Conventional computer vision methods that rely on manually defined image features for classification often have difficulty with WSIs, as they include complex biological structures (cell nuclei, cytoplasms, granules, etc.) with high variability in appearance across multiple patients and disease states (such as cancer grades or stages). To reduce the computation required for processing these gigapixel WSIs, we first extract regions of interest containing WBCs at 50× optical resolution (downsampled from the highest 400× resolution by a factor of 8) using a combination of deep learning and classical computer vision metrics. From these regions, we extract cell images of size 96 × 96 at 400× magnification, cropped around the cell centers detected by a YOLOv8 model^1^ (Figure 1A)). We examined a cohort of 573 patients (Supplementary Table 8A). For each WSI, we extract ~ 2000 WBC images, and subsequently extract 1000-dimensional feature embeddings for each of them using the last-layer features of a ResNeXt-50 model trained specifically on WBC classification task^2^.

Incorporating cell morphology into the predictions

1.2

Current computational approaches for the diagnosis of hematologic neoplasms involve predicting the blood differential through summation of individual cell classifications. Trained ML classifiers use the cytometric vector as input to predict diagnosis. This approach loses morphological information of individual cells, which pathologists use when making diagnostic decisions. Moreover, it lacks cell-level interpretability, without which the final predictions are difficult for pathologists to interpret and audit, making it difficult to use in a clinical setting.

To incorporate individual cell morphology, we propose a trained encoder to first generate vector-embÏeddings of all the extracted cell images (Figure 1B). Our MIL model then aggregates these cell embeddings and makes a slide-level prediction. Explainability is built into the MIL aggregator to facilitate pathologists’ interpretation of the model’s predictions.

CAREMIL model architecture

1.3

The CAREMIL architecture^1^ is built to ensure that we can take a varying number of cells as input and get relevant attention values for each cell embedding, while making a diagnosis prediction (see Figure 1C. Once all the cell images are extracted, they are passed to a frozen image encoder, that creates embedding values that represent the morphological information in each cell image.

The cell embeddings are passed to different trainable networks, namely the aggregator network and the parallel network, both will reduce each cell’s embedding size to the same dimension. The embeddings outputs from the aggregator network are then passed to aggregator functions. We explore several aggregation functions (Supplementary Figure 8C). For example, Kraus et al.^3^ showed that generalized mean can lead to better predictions in some computational pathology applications, whereas, Carmichael et al.^4^ showed that variance-based aggregation function can lead to better predictions by capturing intratumoral heterogeneity. These function outputs represent compositional information of the bloodstream, similar to the cytometric vector produced in the cytometric based ML classifiers. The vectors are concatenated with the outputs of the parallel network before being fed into the self-attention layer. Here the aggregation function outputs act as class tokens^5^, similar to BERT^6^ and Vision Transformers^7^. The class tokens from the attention layer are then fed into a multilayer perceptron (MLP) classifier to predict logits and train the network using a cross-entropy loss.

Case-Level Attention Visualizations

1.4

To validate CAREMIL’s interpretability in diagnosing hematologic malignancies, attention panel images were generated for cases of acute myeloid leukemia (AML), hairy cell leukemia (HCL), and myelodysplastic syndromes (MDS). For each case, we selected the top 10% of cells with the highest attention scores, a random sample of cells, and the lowest attention-scored cells. These panels allowed visual verification of the model’s focus, with high-attention cells expected to show disease-relevant morphologies (e.g., blasts in AML, dysplastic myeloid cells in MDS, and abnormal lymphocytes in HCL). Randomly selected cells provide a sampling of the sample’s overall composition, while low-attention cells should include morphologically irrelevant or non-diagnostic cells. This approach enables assessment of CAREMIL’s attention mechanism in identifying cells most relevant to the diagnostic outcome.

Limit of Detection Experiment

1.5

To assess CAREMIL’s sensitivity for detecting acute leukemia (AL) in cases with low blast counts, synthetic patient cases were generated by combining cells from normal patients with varying proportions of blast cells. For each synthetic case, cells from normal peripheral blood smear images were incrementally mixed with an increasing percentage of blast cells (ranging from 0% to 20%), simulating conditions of low to moderate blast presence. CAREMIL was then run on each synthetic patient case to evaluate the model’s probability of predicting AL at different experimental blast levels (ref Figure 3). Real patient cases with similar blast count distributions were also evaluated for comparison, allowing analysis of CAREMIL’s performance across synthetic and real datasets.

Attention Mechanism Evaluation

1.6

To evaluate how CAREMIL’s attention mechanism selectively prioritizes specific cell types in different hematologic malignancies, we compared the distribution of high-attention cells across acute leukemia (AL), myelodysplastic syndrome (MDS), and hairy cell leukemia (HCL) cases. For each disease type, we identified the cell types based on DeepHeme cell classification top 10% highest attention cells as outputted by the MIL model. This allows for calculation of the mean differential distribution across various cell types with the highest attention. These distributions were then compared to the overall cell population in normal samples to identify disease-specific enrichment patterns. This analysis provides insight into the model’s diagnostic process, enabling validation of the attention mechanism’s interpretability and effectiveness in distinguishing distinct hematologic conditions.

Supplementary Material

Supplementary Files

This is a list of supplementary files associated with this preprint. Click to download.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Kimura K. Automated diagnostic support system with deep learning algorithms for distinction of Philadelphia chromosome-negative myeloproliferative neoplasms using peripheral blood specimen. Sci. Reports 11, 3367 (2021). DOI 10.1038/s 41598-021-82826-9. · doi ↗
2Foucar K. Concordance among hematopathologists in classifying blasts plus promonocytes: a bone marrow pathology group study. Int. J. Lab. Hematol. 42, 418–422 (2020).32297416 10.1111/ijlh.13212 · doi ↗ · pubmed ↗
3Döhner H., Weisdorf D. J. & Bloomfield C. D. Acute myeloid leukemia. New Engl. J. Medicine 373, 1136–1152 (2015).
4Pelcovits A. & Niroula R. Acute myeloid leukemia: a review. Rhode Island medical journal 103, 38–40 (2020).
5Wintrobe M. M. Clinical hematology. Acad. Medicine 37, 78 (1962).
6Chase M. L. Consensus recommendations on peripheral blood smear review: defining curricular standards and fellow competency. Blood Adv. 7, 3244–3252 (2023). DOI 10.1182/bloodadvances.2023009843.36930800 PMC 10336252 · doi ↗ · pubmed ↗
7Campo E. The international consensus classification of mature lymphoid neoplasms: a report from the clinical advisory committee. Blood, The J. Am. Soc. Hematol. 140, 1229–1253 (2022).
8Sidhom J.-W. Deep Learning for Distinguishing Morphological Features of Acute Promyelocytic Leukemia. Blood 136, 10–12 (2020). URL https://doi.org/10.1182/blood-2020-135836. DOI 10.1182/blood-2020-135836. · doi ↗