Artificial intelligence-based classification of Spitz tumors
Ruben T. Lucassen, Marjanna Romers, Chiel F. Ebbelaar, Aia N. Najem, Donal P. Hayes, Antien L. Mooyaart, Sara Roshani, Liliane C.D. Wynaendts, Nikolas Stathonikos, Gerben E. Breimer, Anne M.L. Jansen, Mitko Veta, Willeke A.M. Blokx

TL;DR
This study explores how artificial intelligence can help distinguish Spitz tumors from melanomas and predict their genetic features, showing promising results compared to human experts.
Contribution
The study introduces AI models that outperform pathologists in classifying Spitz tumors and predicting their genetic aberrations.
Findings
AI models achieved an AUROC of 0.95 and accuracy of 0.86 in distinguishing Spitz tumors from melanomas.
AI predicted genetic aberrations with 0.55 accuracy, significantly better than random guessing.
AI-based recommendations could reduce pathology workflow costs and examination times.
Abstract
Spitz tumors are diagnostically challenging due to overlap in atypical histological features with conventional melanomas. We investigated to what extent artificial intelligence (AI) models, using histological and/or clinical features, can: (1) distinguish Spitz tumors from conventional melanomas; (2) predict the underlying genetic aberration of Spitz tumors; and (3) predict the diagnostic category of Spitz tumors. The AI models were developed and validated using a retrospective cohort from the University Medical Center Utrecht, the Netherlands. The dataset consisted of 393 Spitz tumors and 379 conventional melanomas. Predictive performance was measured using the area under the receiver operating characteristic curve (AUROC) and the accuracy. The performance of the AI models was compared with that of four experienced pathologists in a reader study. Moreover, a simulation experiment was…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · AI in cancer detection · Artificial Intelligence in Healthcare and Education
Introduction
Cutaneous melanocytic lesions are categorized into many subtypes, each with distinct biological behavior.1 One of these subtypes, known as Spitz tumors, mostly develops at a young age and is histologically characterized by the presence of large epithelioid and/or spindled melanocytes with variable cytonuclear atypia.2 Similar atypia is also frequently seen in conventional melanomas, making it challenging at times to differentiate the two based on histopathological assessment alone, as evidenced by only a moderate inter-observer agreement between expert dermatopathologists.3 Whereas conventional melanomas are by definition malignant, the majority of Spitz tumors display benign biological behavior. For this reason, there is a high risk of both under- and overtreatment in case of misdiagnosis.2 Immunohistochemical (IHC) staining and molecular analyses can often alleviate the diagnostic challenge by identifying a defining genetic aberration (i.e., a BRAF or NRAS mutation in conventional melanomas and an HRAS mutation or kinase fusion in Spitz tumors),4 but are more expensive and time-consuming to perform.
Recent advances in artificial intelligence (AI) show promising results for a range of diagnostic and prognostic applications in pathology.5^,^6 Several studies have explored the use of AI models for classification of Spitz tumors using learned or human-interpreted features from whole-slide images (WSIs), but were mainly limited by small datasets and a lack of genetic confirmation of the defining driver aberration for all lesions included.7, 8, 9 In this study, we investigate the accuracy with which an AI model, using histological and/or clinical features, can perform three prediction tasks: (1) distinguishing Spitz tumors from conventional melanomas; (2) predicting the underlying genetic aberration of Spitz tumors (i.e., a fusion in ALK, ROS1, NTRK, or all other Spitz-related aberrations); and (3) predicting the diagnostic category of Spitz tumors (i.e., benign, intermediate, or malignant). We conduct a reader study to compare the performance of the AI models with that of four experienced pathologists. Moreover, to study how implementing AI-based recommendations for ancillary diagnostic testing could affect the workflow of the pathology department, we conduct a simulation experiment. Whereas perfect predictive performance for all of these tasks is unlikely based on histological and clinical features alone, even an AI model with reasonable performance can potentially be valuable, for example, as a decision-support tool for guiding pathologists in the selection of ancillary diagnostic tests to reach the correct diagnosis more efficiently.
Methods
Study design
This retrospective cohort study was performed using archival data from the pathology department of the University Medical Center (UMC) Utrecht, the Netherlands. All genetically confirmed conventional melanomas and Spitz tumors accessioned between January 1, 2013 and August 31, 2023, were included. The study does not fall within the scope of the Dutch Medical Research Involving Human Subjects Act (WMO) and therefore does not require approval from an accredited medical ethics committee in the Netherlands. Nevertheless, an independent quality assessment (25U-0162) was conducted at the UMC Utrecht to ensure compliance with relevant laws and regulations, including those related to the informed consent procedure, data management, privacy, and legal considerations. All data were pseudonymized. Data from patients who opted out of the use of their data for research purposes were excluded.
Dataset curation
A total of 772 primary cutaneous melanocytic lesions were included in the dataset, comprising 379 conventional melanomas and 393 Spitz tumors (including nevi, melanocytomas, and melanomas). The pathway of the lesions (i.e., Spitz or conventional melanoma) was confirmed using IHC staining, fluorescence in situ hybridization, next-generation sequencing, and/or targeted RNA sequencing. Lesions without a confirmed pathway were excluded. The diagnostic category of the lesions was determined by pathologists' assessment as gold standard, based on the histological features, IHC stains (e.g., PRAME and p16 expression), and genetic analysis (e.g., the number of segmental copy number variations determined using single-nucleotide polymorphism (SNP) array analysis, absence or presence of secondary pathogenic mutations in for instance the TERT promoter or in the TP53 or CDKN2A gene). The majority of the included lesions (80.4%) concerned referral cases for consultation. Hence, for most lesions, there were WSIs available of slides prepared at the referring center and internal slides prepared at the pathology department of the UMC Utrecht, with a different hematoxylin and eosin (H&E) appearance due to variation in preparation and staining protocols. Characteristics of the lesions in the dataset are summarized in Table 1. A pre-existing nevus was observed in 21.1% of the conventional melanomas.Table 1. Patient and lesion characteristics for Spitz tumors and conventional melanomas.Table 1. CharacteristicsSpitz tumorsConventional melanomas(N = 393)(N = 379)Age Median (IQR)27 (16)48 (28) Min-Max1–733–85Sex (%) Male118(30.0)158(41.7) Female275(70.0)221(58.3)Location (%) Head and neck32(8.1)56(14.8) Trunk73(18.6)154(40.6) Upper extremities66(16.8)55(14.5) Lower extremities197(50.1)94(24.8) Hands and feet23(5.9)13(3.4) Unknown2(0.5)7(1.8)Diagnostic category (%) Benign209(53.2)– Benign/intermediate37(9.4)– Intermediate95(24.2)– Intermediate/malignant17(4.3)– Malignant35(8.9)379(100.0)Genetic aberration (%)Mutations BRAF–263(69.4) NRAS–112(29.6) BRAF & NRAS–4(1.1) HRAS34(8.7)– ROS11(0.3)–Fusions ROS1106(27.0)– NTRK111(28.2)– NTRK117(4.3)– NTRK231(7.9)– NTRK327(6.9)– Unknown36(9.2)– ALK59(15.0)– MAP3K841(10.4)– BRAF18(4.6)– RET18(4.6)– MET4(1.0)– RASGFR11(0.3)–WSI availability (%) Internal and consultation264(67.2)220(58.0) Internal only102(26.0)117(30.9) Consultation only27(6.9)42(11.1)
The tissue specimens consisted of shave and punch biopsies, excisions, and re-excisions. If multiple specimens of the same lesion were available, for example, in case of an initial biopsy followed by a re-excision with lesion tissue remaining, the WSIs were grouped at the lesion level. All WSIs of unique, H&E-stained slides with lesion tissue present were included per lesion. Image acquisition was performed using a ScanScope XT scanner (Aperio, Vista, CA, USA) at 20× magnification with a resolution of 0.50 μm per pixel (slides scanned before 2016), a NanoZoomer 2.0-XR scanner (Hamamatsu photonics, Hamamatsu, Shizuoka, Japan) at 40× magnification with a resolution of 0.23 μm per pixel (slides scanned starting from 2016 to May 2022), and a NanoZoomer S360 scanner (Hamamatsu photonics, Hamamatsu, Shizuoka, Japan) at 40× magnification with a resolution of 0.23 μm per pixel (slides scanned after May 2022).
The dataset was randomly split at the patient level into a model development set of 580 lesions (75%) and a test set for evaluation of 192 lesions (25%). The development set was further subdivided into 5 folds for cross-validation. To investigate the model performance subject to variation in H&E staining, only lesions with both WSIs of internal and consultation slides were sampled for inclusion in the evaluation set, this while maintaining a prevalence of lesion (sub)types comparable to the development set.
Feature representation
Tissue cross-sections and pen markings were segmented in each WSI at 1.25× magnification using SlideSegmenter.10 The resulting tissue segmentation map was used to guide the slide tessellation. Non-overlapping image tiles were extracted from the tissue regions of the WSIs at 20× magnification. Tiles mostly showing the uninformative background of the slide (i.e., for less than 5% covered by tissue) and tiles showing pen markings were excluded. The remaining image tiles were converted into feature vectors, capturing the visual information in a compressed form to reduce the computational demands for analysis. Feature vectors were extracted for all tissue tiles using three different feature encoders: (1) first stage of HIPT11 producing 192-dimensional feature vectors for tiles of 256 × 256 pixels; (2) second stage of HIPT producing 384-dimensional feature vectors for tiles of 4096 × 4096 pixels; and (3) UNI12 producing 1024-dimensional feature vectors for tiles of 224 × 224 pixels.
Model training
AI models were trained for each of the three classification tasks using the three sets of extracted feature vectors. Across all combinations of the task and feature vector set, model training was repeated five times, each using a different fold for validation and the remaining 4 folds for training. Because the number of extracted feature vectors varies per case, only feature vectors from a single case were used per iteration (i.e., a batch size of one). The Vision Transformer (ViT)13 (depth = 2, heads = 4, MLP-ratio = 4, embedding dimension = 192) was used as model architecture. The models were trained by minimizing the cross-entropy loss for 32,000 iterations starting from randomly initialized parameters using the AdamW14 optimization algorithm (β_1_ = 0.9, β_2_ = 0.999). To counteract the class imbalance in the diagnostic category prediction task, which was not as severe for the other two tasks, the models were optimized with balancing class weights for this task. Gradients were accumulated over every 32 iterations. The learning rate was 5 × 10^-5^ at the start and halved after every 6400 iterations. The network parameters that resulted in the smallest loss on the validation fold were saved, which was evaluated after every 320 iterations. The models were trained with attention dropout (p = 0.5). In addition, feature vectors were randomly excluded during training as another form of dropout (p = 0.5). If the total number of features for a case exceeded the maximum of 25,000 feature vectors, a subset equal in size to the maximum was randomly selected. Hyperparameters were tuned based on the average performance on the five validation folds. The predicted probability threshold for the binary classification task was optimized based on the performance on the validation set for each model. For the classification tasks with more than two classes, the class with the largest predicted probability was considered to be the predicted class. The model as well as the training and evaluation procedure were implemented in the Pytorch15 framework. The code and trained model parameters are made publicly available.1
Experimental setup
For the Spitz classification tasks, we compared three approaches: (1) logistic regression using clinical features only (i.e., age, sex, and anatomical location); (2) ViTs using image features only (based on the first and second stage of HIPT as well as UNI); and (3) logistic regression using the clinical features in combination with the image-based feature vector extracted before the final layer of the ViTs. Because some Spitz tumors harbor rare genetic aberrations, not enough cases were available to form a separate class for development and evaluation of the AI models, which is why the cases were grouped for classification into an ALK, ROS1, NTRK, and other class. This aligns well with the fact that the IHC stains for ALK, ROS1, and NTRK are also the most widely available and commonly used for Spitzoid lesions. Similarly, Spitz tumors with a differential diagnostic category were grouped with the more severe category for classification (i.e., benign/intermediate was grouped with intermediate and intermediate/malignant was grouped with malignant). The predicted probability for individual cases with more than the maximum of 25,000 feature vectors was considered to be the average of the predicted probabilities based on 10 randomly selected subsets of the maximum size. Probabilities predicted by the five model instances developed in the cross-validation were averaged to obtain model ensemble predictions. Model performance was measured in terms of the area under the receiver operating characteristic curve (AUROC) and accuracy on both the internal and consultation test set. The AUROC for multi-class classification tasks was computed per class using a one-versus-rest approach. Stratified bootstrapping (R = 10,000 samples) was used to calculate 95% confidence intervals (CIs) using the percentile method. A binomial test was used to statistically compare the accuracy of the best AI model to the expected accuracy when randomly guessing. A p value below 0.05 was considered statistically significant.
Reader study
We conducted a reader study to compare the performance of the best AI models with that of pathologists' assessment on the three classification tasks. We recruited two pathologists from different academic centers and two pathologists from non-academic centers, all of whom had five or more years of experience in dermatopathology. A stratified subset of 100 cases was randomly selected from the internal test set. The reader study was performed using the SlideScore platform,2 where the pathologists were provided with the most representative WSI per case and the corresponding clinical information. The participating pathologists were blinded from any additional diagnostic information (e.g., IHC-stained slides or findings from molecular analyses). Only if a case was classified as a Spitz tumor by the pathologist, the questions related to the genetic aberration and diagnostic category appeared and could be answered in the user interface. The order in which the cases were presented was randomized. For a fair comparison, we also evaluated the best AI model on the subset of selected cases using only the most representative WSI, different from before, where all WSIs with tumor tissue present were provided. McNemar's exact test16 was used for statistical comparison between the accuracy of each pathologist and the best AI models on the three tasks. The Bonferroni-correction was applied to adjust the p values for multiplicity (four comparisons). Because the genetic aberrations and diagnostic categories were only predicted by the pathologists when a lesion was first identified as a Spitz tumor, the statistical comparison of the accuracy for these two tasks was limited to subset of true Spitz tumors with pathologists' predictions available, which differed for each pathologist. The corresponding AI model predictions were selected for each subset to allow for paired comparisons.
Simulation experiment
A simulation experiment was conducted to investigate how implementing AI-based recommendations of ancillary diagnostic tests based on the predicted genetic background of Spitz tumors could affect the workflow of the pathology department. A flowchart of the simulated workflow variants is shown in Fig. 1. The typical workflow at the pathology department of the UMC Utrecht starts with performing the Spitz (i.e., ALK, ROS1, and NTRK) IHC stains, which are followed by molecular diagnostics if necessary. In the simulation, as soon as a positive IHC stain is identified, potentially remaining IHC stains and molecular diagnostics are not performed anymore. Two baseline variants were defined, with IHC stains performed either in parallel or sequentially, ordered from high to low prevalence of the corresponding genetic aberration. The baselines were also expanded by incorporating AI-based recommendations. If the AI model classifies a lesion to be part of the class with other Spitz tumors (i.e., not ALK, ROS1, and NTRK-fused) with a predicted probability that exceeds the threshold T, IHC staining is skipped and molecular analysis is performed directly. In addition, the order of the sequential IHC stains can alternatively be based on the probabilities predicted by the AI model instead of the prevalence. To put the results of the best AI model for genetic aberration prediction into perspective, the simulation was also repeated with a hypothetical perfect AI-based recommendation system.Fig. 1. Flowchart of the baseline and AI-incorporated workflow variants for the simulation experiment. In the baseline workflow, IHC staining is either performed in parallel or in sequence, ordered from high to low prevalence. In the workflow with AI-based recommendations based on the predicted probabilities for the genetic aberrations, IHC staining is either skipped or performed in parallel, in sequence, ordered from high to low prevalence, or in sequence, ordered from high to low predicted probability (abbreviated as pred. prob.).Fig. 1
All simulated workflow variants were repeated for 10,000 iterations. Per iteration, 100 Spitz cases were randomly sampled with replacement from the test set, which approximately reflects the number of genetically confirmed Spitz cases diagnosed annually in the pathology department of the UMC Utrecht. The Spitz IHC stains were assumed to cost €100 each17 and to require 1 day of processing time. Molecular diagnostics was assumed to cost €100018^,^19 and to require 10 days of processing time. The assumed costs and turnaround times were based on our experience at UMC Utrecht and values reported in the literature, but may vary between centers. False-negative or ambiguous IHC stains are not uncommon in practice and were incorporated in the simulation. Based on the proportions in the complete dataset, the probabilities of an ALK, ROS1, and NTRK IHC stain being false negative or too ambiguous for definitive diagnosis in the simulation were 0.055, 0.448, 0.255, respectively. Empirically, we found T = 0.5 to be a suitable threshold for the AI model we developed. The simulation results include the mean and 95% CI of the material cost accumulated over 100 cases, the average turnaround time per case, and the average number of examinations by a pathologist because of new diagnostic information per case (e.g., an initial examination of H&E-stained slides, followed by re-examination after the IHC-stained slides have been prepared, followed by another re-examination after the results of molecular analyses are available, equals three examinations in total).
Results
Spitz tumor versus conventional melanoma prediction
The test set results of the prediction models for distinguishing Spitz tumors from conventional melanomas are shown in Table 2. The logistic regression model based only on clinical features achieved an AUROC of 0.80 (95% CI, 0.74–0.86) and an accuracy of 0.74 (95% CI, 0.66–0.79). In comparison, all AI models based only on image-extracted features performed better than the clinical model. Using the second-stage features of HIPT resulted in slightly higher performance scores than using the features after the first stage of HIPT. The best performance was obtained by the AI model based on the features extracted using UNI with an AUROC of 0.95 (95% CI, 0.92–0.98) and an accuracy of 0.86 (95% CI, 0.81–0.91) using the internal WSIs, which was statistically significantly different (p < 0.001) from the expected accuracy of 0.50 for random predictions. The classification results based on the UNI features are shown in more detail in the confusion matrices in Fig. 2A and B. Out of the seven Spitz tumors incorrectly classified by the model as conventional melanomas, three were benign Spitz nevi, three were Spitz melanocytomas, and one was a Spitz melanoma. Overall, the performance was slightly better when evaluated on the internal WSIs than on the consultation WSIs. Combining the best image-extracted features with the clinical features resulted in comparable performance.Table 2. Results for the Spitz tumor versus conventional melanoma prediction on the test set.Table 2. FeaturesFeature extractorInternal WSIsConsultation WSIsAUROC (95% CI)Acc. (95% CI)AUROC (95% CI)Acc. (95% CI)Clinical only–0.80 (0.74–0.86)0.74 (0.68–0.80)0.80 (0.74–0.86)0.74 (0.68–0.80)Image onlyHIPT (stage 1)0.84 (0.78–0.90)0.77 (0.71–0.83)0.82 (0.76–0.87)0.75 (0.69–0.80)HIPT (stage 2)0.87 (0.82–0.92)0.79 (0.73–0.84)0.85 (0.79–0.90)0.74 (0.68–0.80)UNI0.95 (0.92–0.98)0.86 (0.81–0.91)0.93 (0.90–0.96)0.85 (0.80–0.90)Clinical & ImageUNI0.95 (0.92–0.98)0.86 (0.81–0.91)0.94 (0.91–0.97)0.85 (0.80–0.90)Acc. = Accuracy.Fig. 2. Confusion matrices with classification results for (A, B) Spitz tumor versus conventional melanoma prediction, (C, D) Spitz genetic aberration prediction, and (E, F) Spitz diagnostic category prediction. Results are shown for the ensemble of five UNI-based ViT models on the internal WSIs (top row) and consultation WSIs (bottom row) of the test set.Fig. 2
Several example cases with corresponding attention maps and classification results for one of the five UNI features-based models in the ensemble are shown in Fig. 3. The attention maps highlight the importance of each tile for the case-level prediction by way of the model-assigned weight. Tiles that were assigned the highest attention weight consistently showed the melanocytic lesion, primarily the dermal component, for both correct and incorrect predictions. Moreover, in conventional melanoma cases with a pre-existing nevus, the nevus tiles often received high attention weights (see the center column of Fig. 3A). Among the conventional melanomas incorrectly predicted to be Spitz tumors, cells with an epithelioid appearance were frequently seen and highlighted in the attention maps (see the rightmost column of Fig. 3A). No consistent patterns were observed for the Spitz tumors incorrectly predicted to be conventional melanomas.Fig. 3. Example cases from the test set. Per case from top to bottom: tissue cross-sections from the most representative whole-slide image for that case, the tiles extracted from the cross-sections (excluding pen markings) colored based on the attention weights assigned by the AI model, the tile with the largest attention weight at a higher magnification, and the classification result. Classification decisions were obtained using the best threshold based on the validation fold. (A) Predictions for conventional melanoma (CM) cases. (B) Predictions for Spitz tumor cases.Fig. 3
Spitz genetic aberration prediction
The best results for the prediction of the genetic aberrations in Spitz tumors were achieved using features extracted with UNI and are shown in Table 3. The AI model reached an accuracy of 0.55 (95% CI, 0.46–0.64) and AUROCs ranging from 0.76 to 0.86 for the different genetic aberrations based on the internal WSIs, with slightly worse performance on the consultation WSIs. The confusion matrices in Fig. 2C and D show that the best classification results were obtained for the other Spitz tumors and the worst for the ALK-fused Spitz tumors. The AI models trained using the features extracted with the first and second stage of HIPT were both outperformed by the UNI-based model (see Supplementary Tables 2 and 3). For comparison, random predictions would approximately yield an accuracy of 0.25 and AUROCs of 0.50. The difference between the accuracy of the best AI model and the accuracy when randomly guessing is statistically significant (p < 0.001). The clinical logistic regression model did not exceed random chance-level performance (see Supplementary Table 1).Table 3. Results for the Spitz genetic aberration prediction on the test set using the image-only AI model based on features extracted with UNI.Table 3. MetricClassesPerformanceInternal WSIsConsultation WSIsAccuracy (95% CI)ALK, ROS1, NTRK, other0.55 (0.46–0.64)0.51 (0.41–0.60)AUROC (95% CI)ALK vs. rest0.79 (0.67–0.89)0.71 (0.56–0.84)ROS1 vs. rest0.76 (0.66–0.85)0.77 (0.68–0.86)NTRK vs. rest0.81 (0.77–0.89)0.77 (0.68–0.85)Other vs. rest0.86 (0.76–0.94)0.81 (0.71–0.91)vs. = versus.
Visual inspection of the attention maps for correctly and incorrectly classified cases revealed some patterns. For example, Spitz tumors predicted to harbor an NTRK fusion regularly displayed epithelioid melanocytes in combination with pigmentation and inflammatory cells on the tiles that were assigned the largest attention weight. Cases predicted to belong to the class with other Spitz tumors frequently showed melanocytes with strong variation in cell size and pronounced nuclear atypia on these tiles. The most important tile for ALK fusion-predicted Spitz tumors occasionally showed spindled melanocytes. It must be noted, however, that these patterns were not consistently observed across all lesions of a predicted subtype, and no clear resemblance was seen between the highest attention tiles for lesions classified to harbor a ROS1 fusion.
Spitz diagnostic category prediction
The best results for the diagnostic category prediction of Spitz tumors were achieved using features extracted with UNI, as shown in Table 4. Evaluated on the internal test set WSIs, the AI model reached an accuracy of 0.51 (95% CI, 0.40–0.60) and AUROCs of 0.62, 0.57, and 0.74 in distinguishing benign, intermediate, and malignant Spitz tumors from the rest, respectively. The classification results are reported in the confusion matrices in Fig. 2E and F, showing that the model only rarely predicted Spitz lesions to most likely belong to the malignant category. In contrast to the previous two prediction tasks, the difference in performance between the image feature encoders is smaller (see Supplementary Tables 5 and 6) and the performance difference on the internal and consultation WSIs is less unequivocal. Random predictions would approximately yield an accuracy of 0.33 and AUROCs of 0.50. The difference between the accuracy of the best AI model and the accuracy when guessing randomly is statistically significant (p < 0.001). Similar to the genetic aberration prediction task, the clinical logistic regression model did not exceed the performance level of random guessing (see Supplementary Table 4).Table 4. Results for the Spitz diagnostic category prediction on the test set using the image-only AI model based on features extracted with UNI.Table 4. MetricClassesPerformanceInternal WSIsConsultation WSIsAccuracy (95% CI)Benign, intermediate, malignant0.51 (0.40–0.60)0.52 (0.41–0.62)AUROC (95% CI)Benign vs. rest0.62 (0.51–0.73)0.65 (0.54–0.76)Intermediate vs. rest0.57 (0.45–0.69)0.62 (0.51–0.73)Malignant vs. rest0.74 (0.56–0.89)0.71 (0.54–0.86)vs. = versus.
Reader study
The results of the reader study, comparing the performance of four pathologists experienced in dermatopathology to that of the best image-only AI models, are shown in Table 5. For each of the three classification tasks, the AI model reached a higher accuracy than the four pathologists. For the first task of distinguishing Spitz tumors from conventional melanomas, the mean accuracy of the pathologists was 0.77, and the accuracy of the AI model was 0.89, with a statistically significant difference between one of the pathologists and the AI model. Several example cases where either most pathologists, the AI model, or both were incorrect were provided in the Supplementary Material (see Supplementary Figs. 1–6). For the second task of predicting the genetic aberration of Spitz tumors, the mean accuracy of the pathologists and the accuracy of the AI model were 0.35 and 0.52, respectively, with no statistically significant differences in the individual comparisons. For the third task of predicting the diagnostic category of Spitz tumors, the pathologists achieved a mean accuracy of 0.36, whereas the AI model achieved an accuracy of 0.54, with no statistically significant differences in the individual comparisons.Table 5. Results of the reader study on a randomly selected, stratified subset of the test set comparing the performance of four pathologists to that of the best image-only AI model across three tasks: (1) distinguishing Spitz tumors from conventional melanomas, (2) predicting the genetic aberration of Spitz tumors, and (3) predicting the diagnostic category of Spitz tumors. The p values were obtained using McNemar's exact test with Bonferroni-correction and pertain to the individual comparisons with the AI model for the corresponding task.Table 5. Spitz tumor vs. conv. melanomaGenetic aberrationDiagnostic categoryNAccuracy (95% CI)p valueNAccuracy (95% CI)p valueNAccuracy (95% CI)p valuePathologist 11000.81 (0.73–0.88)0.39430.28 (0.16–0.40)0.05430.35 (0.21–0.49)>0.99Pathologist 21000.79 (0.72–0.86)0.17500.44 (0.30–0.58)>0.99500.40 (0.36–0.46)>0.99Pathologist 31000.71 (0.63–0.79)0.002460.33 (0.20–0.46)0.17460.33 (0.24–0.41)0.54Pathologist 41000.76 (0.67–0.84)0.10390.36 (0.23–0.49)0.95390.36 (0.26–0.46)>0.99AI models1000.89 (0.83–0.95)–500.52 (0.42–0.62)–500.54 (0.40–0.66)–vs. = versus, conv. = conventional.
Simulation experiment
The results of the simulation experiment are shown in Table 6. Performing the Spitz IHC stains sequentially, compared to performing them in parallel, has a lower accumulated material cost, whereas the average turnaround time and the number of examinations are higher. Adopting AI-based recommendations, both by skipping IHC staining for Spitz tumors predicted to harbor a different genetic aberration (i.e., not ALK, ROS1, or NTRK-fused) and by performing the sequential IHC stains ordered based on the predicted probability instead of the prevalence, improved the efficiency over the baseline approaches. More specifically, for the parallel IHC staining variant, the material cost accumulated over 100 cases decreased by €2671 (3.5%), the average turnaround time increased by 0.17 days (3.0%), and the average number of examinations decreased by 0.18 (7.3%). For the variant with sequential IHC staining, the material cost accumulated over 100 cases decreased by €3996 (5.6%), the average turnaround time decreased by 0.40 days (6.0%), and the average number of examinations decreased by 0.76 (19.6%). Further improvements were observed for both variants across all three metrics using the hypothetical perfect AI-based recommendations in the workflow.Table 6. Results for the simulation experiment. A baseline, AI-based recommendation, and hypothetical perfect AI-based recommendation workflow were compared using parallel and sequential immunohistochemistry (IHC) assessment. The performance was measured in terms of the material cost accumulated over 100 cases, the average turnaround time per case, and the average number of examinations per case. The colors range from red to green for the maximum to minimum value per metric.Table 6. Pred. prob. = Predicted probability.
Discussion and conclusion
In this study, we investigated the extent to which an AI model can accurately distinguish Spitz tumors from conventional melanomas and predict the underlying genetic aberration and diagnostic category of Spitz tumors. We conducted a reader study to compare the predictive performance of AI models with that of four pathologists on these tasks. Additionally, to better understand how AI-based recommendations for ancillary diagnostic testing could affect the workflow of the pathology department, we performed a simulation experiment.
The best AI model correctly distinguished most Spitz tumors from conventional melanomas, as evidenced by an AUROC of 0.95 and an accuracy of 0.86 on the test set. The classification performance varied between feature extraction models, with the second stage of HIPT performing better than the first stage, whereas both were outperformed by UNI. These findings align with previously reported results for classification tasks in other pathology domains.11^,^20^,^21 Our results showed that a logistic regression model based solely on age, sex, and anatomical location performed reasonably well; however, using these clinical features in combination with the best image-based prediction model did not improve performance. This is noteworthy, as pathologists typically do heavily rely on clinical information when diagnosing Spitzoid lesions. Slightly lower performance was observed in the evaluation based on the consultation WSIs, which can likely be attributed to the variation in tissue appearance due to differences in preparation and staining protocols between centers.22 Moreover, the presence of nevus cells is a relevant histological feature for diagnosis, as these cells are regularly seen together with conventional melanomas (i.e., in the form of a pre-existing nevus), whereas a nevus next to a Spitz tumor is very uncommon.23 The attention visualization suggests that the AI model has also learned to recognize this characteristic (see the center column of Fig. 3A). An epithelioid cellular appearance was seen among many of the conventional melanomas incorrectly predicted to be Spitz tumors, which is also often seen in true Spitz tumors. No consistent patterns were observed among the Spitz tumors incorrectly predicted to be conventional melanomas.
For predicting the genetic aberration, the best AI model reached a classification performance significantly above random chance-level, reaching an accuracy of 0.55, where random predictions would yield 0.25. Visual inspection of the tiles with the highest attention weights revealed some patterns consistent with characteristics described in case studies of Spitz tumors with specific genetic aberrations,24, 25, 26, 27, 28, 29 although interpretation remained challenging. Incorporating positional embeddings can potentially further improve classification performance by enabling the AI model to also capture the lesion morphology at lower magnification as well.
The diagnostic category prediction was the most challenging task, as the best AI model achieved an accuracy of 0.51, compared to 0.33 for random guessing. Moreover, the malignant category was rarely predicted to be the most probable class, despite the use of balancing class weights during training. To reach a diagnostic category for Spitz tumors in clinical practice, pathologists need to integrate histological, IHC, and genetic features to arrive at a diagnosis, without strict criteria for which feature combinations constitute a Spitz nevus, melanocytoma, or melanoma.1 Despite the improvement in agreement between experts with the availability of genetic information in the study of Benton et al.,3 disagreement remained in a considerable number of cases, illustrating the difficulty of diagnosing Spitzoid lesions. This diagnostic uncertainty may have also affected the model development and evaluation in this study. Nevertheless, the limited predictive performance is likely primarily due to the absence of histological characteristics that correlate with the genomic background and diagnostic category.
The reader study showed that the AI model for each of the three Spitz classification tasks reached a higher accuracy than the four pathologists with experience in dermatopathology, although the difference in accuracy was not statistically significant for most individual comparisons. It is important to note that pathologists in clinical practice typically rely on IHC stains and molecular diagnostics to differentiate Spitz tumors from conventional melanomas, and to determine the underlying genetic aberration and diagnostic category. It should therefore be expected that most pathologists are not used to performing these tasks without additional diagnostic information being available. Other factors which could have affected the pathologists' assessment include: (1) the cases were randomly selected with stratification to obtain mostly balanced classes for each of the three tasks, which ensured adequate representation of rare classes for evaluation purposes, but also resulted in class distributions that deviated from the real-world prevalences (e.g., Spitz melanomas are much more rare than Spitz nevi); (2) the pathologists performed all three tasks at once, whereas separate AI models were trained for the respective tasks; and (3) the WSI appearance and viewing application likely differed from the routine setup of the pathologists.
Through a simulation experiment, we studied how implementing AI models for predicting genetic aberrations might impact the workflow of the pathology department. Whereas the accuracy is currently not high enough to serve as a replacement for IHC staining or molecular analyses, we demonstrated that AI-based recommendations on the selection of ancillary diagnostic tests can potentially improve workflow efficiency by reducing the total material cost, the turnaround times, and the number of examinations. Although the genetic background of Spitz tumors can also be predicted by pathologists and does not necessarily require an AI model, this task is challenging, as seen in the reader study, and is not routinely performed in clinical practice at the moment. The AI model could, therefore, serve as a tool for pathologists to reach the correct diagnosis faster while reducing costs. The scope of the simulation experiment was limited to Spitz tumors, which does not completely reflect clinical practice, where melanocytic lesions can also be other subtypes, but does show how efficiency gains could be achieved while keeping the simulation complexity manageable. Further improvement of the predictive accuracy would yield larger gains in the direction of the hypothetical perfect AI model. Extending this approach in future work to other relevant IHC stains (e.g., BRAF, BAP1, and β-catenin) by incorporating additional melanocytic lesion subtypes could improve the representativeness of the simulation and may lead to larger benefits.30 In addition, simulation can also be useful before starting model development to investigate the level of accuracy required in terms of expected savings to justify the costs of AI model implementation.
Despite this being the largest study into AI-based classification of Spitz tumors, the dataset size remains a limiting factor, and improvements in model performance may be possible after training on more data. Additionally, only Spitz tumor or conventional melanoma cases confirmed by a positive IHC stain for a Spitz marker and/or molecular analysis were included in the study cohort. This inclusion criterion has likely introduced some form of selection bias, as conventional melanomas are not always genetically characterized in routine practice, nor do all harbor a BRAF or NRAS mutation. Improvements in molecular diagnostic equipment have also enabled the identification of more Spitz subtypes over time. In combination with the specialized caseload as a consultation center, this could have resulted in prevalences that differ from those in the general population.
In conclusion, the AI model achieved a strong predictive performance in distinguishing Spitz tumors from conventional melanomas. On the more challenging tasks of predicting the genetic aberration and the diagnostic category of Spitz tumors, the AI models performed better than random chance. The potential benefits of implementing AI-based recommendations for ancillary diagnostic testing were demonstrated using a simulation experiment.
CRediT authorship contribution statement
R.L., M.V., and W.B. conceptualized the study. R.L., M.R., C.E., A.N., N.S., G.B., A.J., and W.B. participated in data curation and verification. R.L. and M.V. designed the methodology. R.L. developed the AI models and performed the model evaluation. A.M, D.H., L.W., and S.R. participated in the reader study. R.L., M.R., and N.S. performed the reader study evaluation. R.L., M.R., M.V., and W.B. analyzed and interpreted the results. R.L. wrote the original draft. M.V. and W.B. supervised the project and participated in funding acquisition. All authors had full access to all the data in the study. All authors read, edited, and approved the final manuscript. All authors accept the final responsibility to submit for publication and take responsibility for the contents of the manuscript.
Ethics approval and consent to participate
The study does not fall within the scope of the Dutch Medical Research Involving Human Subjects Act (WMO) and therefore does not require approval from an accredited medical ethics committee in the Netherlands. Nevertheless, an independent quality assessment (25U-0162) was conducted at the UMC Utrecht to ensure compliance with relevant laws and regulations, including those related to the informed consent procedure, data management, privacy, and legal considerations. The need for informed consent was waived due to the cohort size and retrospective nature of the study.
Declaration of competing interest
The authors declare no competing interests.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1WHO Classification of Tumours Editorial Board WHO Classification of Tumours Series: Skin Tumours 5th ed.2023 International Agency for Research on Cancer Lyon (France)[Internet; beta version ahead of print]. Available fromhttps://tumourclassification.iarc.who.int/chapters/64[Accessed on 10 July, 2025]
- 2Harms K.L.Lowe L.Fullen D.R.Harms P.W.Atypical Spitz tumors: a diagnostic challenge Arch Pathol Lab Med 1392015126312702641447210.5858/arpa.2015-0207-RA · doi ↗ · pubmed ↗
- 3Benton S.Zhao J.Zhang B.Impact of next-generation sequencing on interobserver agreement and diagnosis of Spitzoid neoplasms Am J Surg Pathol 452021159716053475798210.1097/PAS.0000000000001753 · doi ↗ · pubmed ↗
- 4Bastian B.C.The molecular pathology of melanoma: an integrated taxonomy of melanocytic neoplasia Annu Rev Pathol 920142392712446019010.1146/annurev-pathol-012513-104658 PMC 4831647 · doi ↗ · pubmed ↗
- 5Song A.H.Jaume G.Williamson D.F.Artificial intelligence for digital and computational pathology Nat Rev Bioeng 12023930949
- 6Shmatkoa.Ghaffari Laleh N.Gerstung M.Kather J.N.Artificial intelligence in histopathology: enhancing cancer research and clinical oncology Nat Cancer 32022102610383613813510.1038/s 43018-022-00436-4 · doi ↗ · pubmed ↗
- 7Hart S.N.Flotte W.Andrew F.Classification of melanocytic lesions in selected and whole-slide images via convolutional neural networks J Pathol Inform 102019510.4103/jpi.jpi_32_18PMC 641552330972224 · doi ↗ · pubmed ↗
- 8Snyder A.N.Zhang D.Dreesen S.L.Histologic screening of malignant melanoma, Spitz, dermal and junctional melanocytic nevi using a deep learning model Am J Dermatopathol 4420226506573592528210.1097/DAD.0000000000002232 · doi ↗ · pubmed ↗
