An Explainable AI Exploration of the Machine Learning Classification of Neoplastic Intracerebral Hemorrhage from Non-Contrast CT
Sophia Schulze-Weddige, Georg Lukas Baumgärtner, Tobias Orth, Anna Tietze, Michael Scheel, David Wasilewski, Mike P. Wattjes, Uta Hanning, Helge Kniep, Tobias Penzkofer, Jawed Nawabi

TL;DR
This study uses explainable AI to understand how a machine learning model distinguishes between cancer-related and non-cancer-related brain hemorrhages using CT scans.
Contribution
The study introduces a novel application of explainable AI to analyze the decision-making of a deep-learning model in classifying neoplastic intracerebral hemorrhage.
Findings
The model relies more on features within the hemorrhage than in surrounding edema.
ICH importance was on average 30% higher than PHE importance in model predictions.
Significant differences in ICH importance were observed between neoplastic and non-neoplastic cases.
Abstract
This study investigates which imaging features a deep-learning model uses to distinguish between neoplastic and non-neoplastic brain hemorrhages. Explainable artificial intelligence techniques show that the model relies primarily on features in the hemorrhage, but also considers features in the surrounding edema. Objective: To understand the importance of different imaging features in the automatic classification of neoplastic and non-neoplastic intracerebral hemorrhage (ICH) using admission CT. Methods: This study builds on a previously published machine learning model for the classification of neoplastic vs. non-neoplastic ICH. In the current work, we analyzed its decision process with explainable AI methods. We compared the average importance of ICH and perihematomal edema (PHE) in the model’s predictions to gain insight into its decision process regarding the etiology…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · Intracerebral and Subarachnoid Hemorrhage Research · Brain Tumor Detection and Classification
1. Introduction
Intracerebral hemorrhage (ICH) associated with primary and metastatic brain tumors presents a significant challenge in neuro-oncology due to the substantial risk of complications [1]. A major contributor to this challenge is the diagnostic complexity, particularly during the early stages of presentation [2]. Patients with tumor-related ICH often exhibit symptoms resembling, among others, those of spontaneous hypertensive hemorrhages, which can frequently serve as the initial clinical manifestation preceding tumor-specific symptoms [2,3,4,5]. This similarity can make differentiation based on clinical and imaging findings difficult, potentially delaying the initiation of etiology-specific work-up protocols. These delays may not only impact prognosis and therapeutic outcomes but also result in unnecessary diagnostic procedures, raising concerns about both clinical and economic efficiency [6,7,8,9,10]. Accurate and early detection of neoplastic ICH is, therefore, essential.
In most cases, computed tomography (CT) imaging remains the gold standard, particularly as these patients often present in acute clinical settings. Recent studies, have proposed various quantitative approaches to leverage perihematomal edema (PHE) characteristics surrounding the hemorrhagic lesion to differentiate neoplastic from non-neoplastic hemorrhages [11,12,13,14]. Notably, in our previous work, we introduced an end-to-end deep learning approach with significant potential for clinical translation [15]. This method combines an automated segmentation model to delineate lesions of interest with a classification model to distinguish neoplastic from non-neoplastic ICH, thereby eliminating the need for manual segmentation.
In the current study, we do not modify or retrain this classification model. Instead, we focus on analyzing how the previously trained model arrives at its predictions. Despite its strong performance, the model—like many deep neural networks—functions as a “black box,” making it difficult to interpret and trust in clinical settings. To address this challenge, we apply post hoc explainable artificial intelligence (XAI) techniques to enhance the transparency and interpretability of the model’s decision-making process [16].
Our hypothesis is that feature attribution methods, which provide pixel-wise significance scores to highlight critical regions, will offer valuable insights into the model’s inner workings and reaffirm the predictive importance of the PHE region in distinguishing between neoplastic and non-neoplastic ICH. To test this hypothesis, we compared the average importance attributed to ICH and PHE regions in the classification process.
2. Methods
2.1. Study Population
Our study consisted of two retrospectively assembled patient cohorts. The first cohort was gathered from Charité University Hospital Berlin, Germany, from January 2016 to May 2020. The second cohort included patients from a further academic hospital, the University Medical Center Hamburg-Eppendorf, Germany, from January 2010 to December 2017. For the purpose of this study, these two cohorts were pooled together to create a larger and more diverse dataset for the evaluation of the explainability methods. The inclusion criteria were consistent across both cohorts, requiring patients to have an ICH diagnosis on CT imaging, followed by MRI imaging. Cases were categorized into non-neoplastic and neoplastic ICH based on the H-ATOMIC classification [17]. Patient characteristics can be found in Table 1. Illustrative cases for neoplastic and non-neoplastic cases can be found in Figure 1.
2.2. Image Analysis and Preprocessing
Non-contrast CT images were retrieved from the local picture archiving and communication system (PACS) servers, anonymized in line with local protocols, and converted to Neuroimaging Informatics Technology Initiative (NifTI) format. Semi-manual planimetric measurements quantified the extent of ICH and PHE. For both cohorts, this analysis was performed by a trained research student or radiology resident with three years of experience in ICH imaging. Additionally, all cases were reviewed by a radiology fellow with eight years of experience in ICH imaging. In case of discrepancies, a consensus reading was performed, as detailed in our previous studies [13,14]. The clinical data and radiological reports were blinded during image analysis. Segmentations were used to mask the CT images, and the images were cropped from the original size of (512, 512, 31) to (200, 200, 20). This simplification substantially reduced the classification task’s complexity, allowing the model to concentrate on relevant image regions.
2.3. Classification Pipeline
In this study, we used XAI to compare the importance of different imaging features in the machine learning-based classification of neoplastic and non-neoplastic ICH, which is based on a previously trained and externally validated classification model [15]. In the original work, a residual neural network (ResNet) model was trained with preprocessed images for the classification of neoplastic and non-neoplastic ICH. The preprocessing entailed a segmentation of the ICH and PHE regions. This segmentation was automatized with an nnU-Net segmentation model. The two models were integrated in an end-to-end pipeline in which the automatically generated segmentations were used to preprocess the images for the classification task. A graphical representation of the workflow is illustrated in Figure 2. The classification model yielded an area under the curve (AUC) of 83% with an accuracy of 80%, sensitivity of 72%, and specificity of 89% on the full study population. Details about the model training can be found in the Supplementary Material or in the original publication [15].
2.4. Explanation Methods
Explanations were generated for 349 cases (144 neoplastic, 205 non-neoplastic). Various established explanation methods have been applied, namely Saliency [18], InputXGradient [19], SmoothGrad [20], Gradient Shap [21], GradCam [22], Guided GradCam [22], and GradCam++ [23]. All employed methods are primary attribution methods determining the importance of individual input features on the output. Each method returns an attribution map the same size as the input image. Each pixel value in the attribution map corresponds to the importance of that pixel for the prediction. The explanations are local, meaning they do not explain the model behavior in general but indicate what was important for the prediction of the specific input instance.
The methods visually differ in the granularity and smoothness of the highlighted regions. Saliency and InputXGradient are first-order gradient-based attribution methods that calculate the gradients of the output with respect to the input. They typically result in sharp, sometimes noisy attribution maps. SmoothGrad and Gradient Shap offer more stable and smooth attribution maps that are less deceptive to noisy inputs. SmoothGrad achieves this by adding noise to the input image and averaging over the resulting attribution maps. Gradient Shap integrates over multiple baselines to estimate feature importance, combining ideas from Shapley values with gradient-based methods. The GradCam versions (GradCAM, Guided GradCAM, GradCAM++) focus more on high-level features and broader areas of importance rather than individual pixels. GradCam++ considers positive and negative gradients separately, which helps in preserving more precise localization information and generating sharper heatmaps.
A comprehensive description of the methods and their differences has been provided in the Supplementary Material. Figure 3 shows an exemplary case for each method. The mean importance of the ICH and PHE region was calculated by averaging the importance of all pixels belonging to that region.
2.5. Faithfulness Metric
The attribution methods were quantitively evaluated using the faithfulness metric, which measures the relevance of selected features for the model’s prediction [24]. The metric is obtained by calculating the correlation coefficient between the attribution value of each pixel and the change in prediction probability when the pixel is replaced by a value from a baseline image. This baseline represents the absence of information and is usually challenging to determine. Common baseline values include zero, the average value of the image, or a random value sampled from the pixel distribution. In this study, a baseline of zeros was chosen consistent with the background masking applied during preprocessing.
Iterating through all pixels of the image, a prediction is made on the image with this pixel value set to zero. The model’s prediction probability for the target class is observed. In case the pixel was important for the prediction, a drop in prediction probability is expected. A high correlation between the changes in prediction probability and the attribution values indicates that the attribution map reflects the model’s decision-making process well.
The metric produces a correlation coefficient with possible values ranging from −1 to 1. Positive values indicate a positive linear relation between the variables. In this case, it means that with increasing attribution values, the prediction probability also increases, which is interpreted as indicating a good explanation. Negative values indicate that there is a negative linear relation between the variables, meaning if the attribution value increases, the prediction probability decreases. A value of 0 means there is no linear relation between the two variables. Values of ±0.1, ±0.3, and ±0.5 typically represent small, medium, and high correlations, respectively, which facilitates the evaluation of the explanatory quality. However, there is no established threshold for a “good” explanation. In this study, we used the faithfulness metric to benchmark and discern the most effective explanation method for our model, without the need for a definitive threshold. The scores of each method are provided in Table 2 and may serve as a reference for future research or comparisons.
2.6. Statistical Analysis
Data was tested for normality with the Shapiro–Wilk test. If the assumption of normality was met, variables were compared with a two-sided t-test, if not with the Mann–Whitney-U test. All two-sided hypothesis tests were considered statistically significant with a level of p < 0.05. The average importance of the ICH region was compared with the average importance of the PHE region. Further, the average ICH and PHE importances were compared between the neoplastic and non-neoplastic cases. Additionally, this comparison was performed for small and large lesions separately, with the median lesion volume separating the two subgroups.
3. Results
We generated and compared explanations for 349 cases (144 neoplastic, 205 non-neoplastic). The median ICH volume was 6.92 mL (IQR 2.21–19.91). There was no significant difference in ICH volume between the two classes (see Table 1). The distribution of ICH volumes for both classes is visualized in a histogram plot in Figure 4.
The explanation method with the highest faithfulness scores was GradCam++, with an average of 0.49 and standard deviation of 0.15. Hence, the attribution methods from GradCam++ were used to compare the average importance of ICH and PHE in neoplastic and non-neoplastic cases. The scores for all explanation methods can be found in Table 2. The overall mean importance was 0.639 for ICH and 0.435 for PHE, compared to 0.014 for the background (BG). Separated by class, the mean importance for ICH was 0.663 in non-neoplastic cases and 0.615 in neoplastic cases, whereas for PHE, they were 0.439 and 0.430, respectively, as detailed in Table 3. The distribution of importance scores is illustrated in a violin plot in Figure 5.
The statistical analysis of the full study population, as well as for the small and large lesions separately, revealed significant differences: (1) between the mean importance of ICH and PHE (all p < 0.001), (2) between the mean importance of ICH in the neoplastic and non-neoplastic group (all p < 0.01), and (3) between the mean BG importance in the neoplastic and non-neoplastic group (all p < 0.001), as detailed in Table 3. No significant difference was found in the mean importance of PHE between the two groups for the full study population (p = 0.54). However, when separated by lesion volume, there was a significant difference. In large lesions, the average PHE importance was higher in neoplastic cases compared to non-neoplastic cases (p = 0.001). The opposite was true for the small lesions, in which the average PHE importance was lower for the neoplastic cases (p = 0.02).
4. Discussion
The predictions of a convolutional neural network (CNN) for the binary classification of neoplastic and non-neoplastic ICH have been explained with the GradCam++ attribution method to gain insights into the inner working of the model and to confirm the importance of PHE for the classification task. An early and accurate differentiation of ICH types is important, as it enables timely and appropriate treatment decisions, which are essential for improving survival rates and reducing long-term disability. Gaining insights into the model by analyzing the average importance of ICH and PHE regions is important because it helps validate the model’s decision-making process and ensures that it aligns with clinical knowledge.
By understanding how the model uses these regions to differentiate between neoplastic and non-neoplastic ICH, we can confirm that the model is focusing on relevant anatomical features, increasing our confidence in its predictions. This transparency enhances the trustworthiness of the model in clinical practice, as it demonstrates that the AI model is making decisions based on meaningful and clinically significant information.
The generated explanations showed that both PHE and ICH were important for the differentiation of neoplastic and non-neoplastic ICH on admission CT. Yet, the ICH region consistently showed a higher average importance than the PHE region in both classes, irrespective of lesion volume. This suggests that the model’s differentiation between the two classes relied less on variations in PHE volume than initially hypothesized. We acknowledge that this finding appears to contrast with previous work, including our own earlier studies that demonstrated the diagnostic value of PHE volume in distinguishing neoplastic from non-neoplastic ICH [12,14]. This highlights important differences between traditional feature-based approaches and deep learning models. Specifically, while prior studies analyzed manually extracted imaging metrics (e.g., absolute and relative PHE volume), the deep learning model used here may prioritize more subtle textural and density cues embedded within the ICH region itself. One possible mechanistic explanation lies in the lower density of neoplastic ICH on CT images compared to non-neoplastic ones. The reasons for the lower density are likely the presence of intermixed tumor tissue and the tumor’s slower hemorrhage compared to abrupt ruptures in hypertensive associated bleedings [11]. This highlights density as a key predictive factor in classifying ICH etiology, which is supported by the significant differences in importance between neoplastic and non-neoplastic cases for ICH (p < 0.001). Our findings did not corroborate the anticipated higher discriminatory power of PHE in the automated classification. Although PHE contributed to the classification process, its importance did not outperform that of ICH.
Interestingly, a significant difference in PHE importance emerged when separating the cohort into lesions smaller and larger than the median ICH volume of 6.92 mL. Our results demonstrated that PHE importance was higher in neoplastic cases only for large lesions, whereas for smaller lesions, PHE importance was lower compared to non-neoplastic cases. This observation underscores the pathophysiological and temporal differences in edema formation between tumor-related and spontaneous hemorrhagic lesions, offering insights into the discriminative capabilities of our XAI approach.
Larger neoplastic ICHs are likely associated with a longer duration of tumor growth, which could lead to a larger preexisting vasogenic PHE. Vasogenic edema is driven by the disruption of the blood–brain barrier, permitting the accumulation of protein-rich fluid in the extracellular space [25]. This phenomenon is particularly prominent in larger tumors, which are associated with greater angiogenesis and vascular permeability [26,27]. Additionally, larger tumors exert a more substantial mass effect, further exacerbating blood–brain barrier disruption and enhancing edema formation [28]. Thus, the pronounced importance of PHE in larger neoplastic lesions detected by our model likely reflects these underlying tumor-related mechanisms.
In contrast, smaller neoplastic hemorrhages may not have had sufficient time or tumor activity to generate significant vasogenic edema. Consequently, the PHE observed in such cases might predominantly result from the acute hemorrhagic insult itself. Early PHE formation within the first four hours post-hemorrhage is primarily osmotic in nature, driven by clot retraction and the release of plasma proteins, rather than tumor-specific mechanisms [29]. This early osmotic edema is pathophysiologically distinct from vasogenic edema, lacking the prolonged, barrier-disruptive processes associated with tumor growth [29].
These findings are consistent with prior studies suggesting that larger tumor size correlates with more extensive vasogenic edema due to enhanced angiogenic activity and chronic blood–brain barrier disruption [26,27]. The observed dependency of PHE importance on lesion size in our XAI model reinforces its utility in capturing these nuanced pathophysiological differences. This underscores the potential of explainable AI to not only enhance diagnostic accuracy but also provide insights into the underlying biological mechanisms.
Lastly, there is a significant difference in average background importance between the neoplastic and non-neoplastic groups. This finding can be observed in the full study population and in the subgroups of different ICH volumes. The Grad-CAM++ method’s smoothing effect during scaling extends attribution scores slightly beyond the lesion border into adjacent background areas. It seems that the model is paying more attention to the edges of the lesion in neoplastic cases, which leads to higher average importance in the background. This interpretation is consistent with known imaging characteristics of intra-axial brain tumors where “radial or finger-like” extensions or irregular shapes of the PHE can suggest neoplastic etiologies [30]. This visual observation aligns with clinical understanding and provides a plausible explanation for why the background area might show importance in distinguishing between neoplastic and non-neoplastic cases.
This study has some limitations. First, the analysis is performed on a single classifier. Expanding the analysis to multiple classifiers would allow us to assess whether the patterns observed here generalize across different models or if alternative classifiers might focus on different imaging features. For this, new classifiers would have to be developed and tested first. In addition, potential clinical confounders such as tumor histology (e.g., primary vs. metastatic origin) and time from symptom onset to imaging were not included in the analysis. These variables may influence the extent and appearance of perihematomal edema (PHE), potentially affecting model interpretation. However, our model was intentionally designed as a radiology-based decision-support tool that operates on admission non-contrast CT alone, reflecting real-world scenarios in which clinical data may be unavailable or unreliable. In particular, the timing of symptom onset is often unclear in patients with neoplastic disease, especially in older adults. While incorporating such clinical variables could enhance diagnostic specificity, it may also limit model applicability. Future work should explore multimodal approaches that integrate imaging with clinical and temporal data to further refine model performance and interpretability.
Second, our preprocessing approach is masking the image background, effectively forcing the model to focus solely on the regions of interest. From an XAI perspective, it would be interesting to see whether a classifier also considers other regions as relevant to its decision-making. However, our preprocessing was specifically designed to enhance model performance, enabling it to focus on clinically meaningful areas and leverage the capabilities of deep learning-based segmentation. This approach aligns with the goal of maximizing predictive accuracy but prevents an analysis of other regions.
5. Conclusions
Our study demonstrates that our previously introduced deep learning model effectively uses both PHE and ICH regions to discern neoplastic from non-neoplastic ICH on admission CT. This suggests that the model’s diagnostic process is grounded in relevant image features rather than incidental associations. Further, the results underscore the ICH region’s predominant influence on the model’s differentiation of ICH types.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Burth S. Ohmann M. Kronsteiner D. Kieser M. Löw S. Riedemann L. Laible M. Berberich A. Drüschler K. Rizos T. Prophylactic anticoagulation in patients with glioblastoma or brain metastases and atrial fibrillation: An increased risk for intracranial hemorrhage?J. Neuro-Oncol.202115248349010.1007/s 11060-021-03716-8PMC 808483533674992 · doi ↗ · pubmed ↗
- 2Ostrowski R.P. He Z. Pucko E.B. Matyja E. Hemorrhage in brain tumor—An unresolved issue Brain Hemorrhages 202239810210.1016/j.hest.2022.01.005 · doi ↗
- 3Choi G. Park D.-H. Kang S.-H. Chung Y.-G. Glioma mimicking a hypertensive intracerebral hemorrhage J. Korean Neurosurg. Soc.20135412512710.3340/jkns.2013.54.2.12524175027 PMC 3809438 · doi ↗ · pubmed ↗
- 4Eminovic S. Orth T. Dell’o Rco A. Baumgärtner L. Morotti A. Wasilewski D. Guelen M.S. Scheel M. Penzkofer T. Nawabi J. Clinical and imaging manifestations of intracerebral hemorrhage in brain tumors and metastatic lesions: A comprehensive overview J. Neuro-Oncol.2024170567578(In English)10.1007/s 11060-024-04811-239222188 PMC 11614960 · doi ↗ · pubmed ↗
- 5Singla N. Aggarwal A. Vyas S. Sanghvi A. Salunke P. Garg R. Glioblastoma Multiforme with Hemorrhage Mimicking an Aneurysm: Lessons Learnt Ann. Neurosci.20162326326510.1159/00044948827780994 PMC 5075732 · doi ↗ · pubmed ↗
- 6Baiguissova D. Laghi A. Rakhimbekova A. Fakhradiyev I. Mukhamejanova A. Battalova G. Tanabayeva S. Zharmenov S. Saliev T. Kausova G. An economic impact of incorrect referrals for MRI and CT scans: A retrospective analysis Health Sci. Rep.20236 e 1102(In English)10.1002/hsr 2.110236923371 PMC 10009910 · doi ↗ · pubmed ↗
- 7Haddadi S. Dehghani M. D’A Mato G. Editorial: Delay in cancer diagnosis and factors affecting outcomes Front. Public Health 202412144276410.3389/fpubh.2024.144276439071154 PMC 11272649 · doi ↗ · pubmed ↗
- 8Khanmohammadi S. Mobarakabadi M. Mohebi F. The Economic Burden of Malignant Brain Tumors Adv. Exp. Med. Biol.2023139420922110.1007/978-3-031-14732-6_1336587390 · doi ↗ · pubmed ↗
