Machine learning based classification of aggressive and malignant renal tumors from multimodal data
Mehrnegar Aminy, Tejal Gala, Agnimitra Dasgupta, Serena Li, Steven Y. Cen, S. J. Pawan, Inderbir Gill, Vinay Duddalwar, Assad A. Oberai

TL;DR
This study uses machine learning with CT scans and clinical data to classify kidney tumors as benign, malignant-indolent, or malignant-aggressive, improving non-invasive diagnosis and treatment planning.
Contribution
A novel machine learning pipeline that classifies renal tumors using multimodal data and identifies tumor aggressiveness for better clinical decision-making.
Findings
The best models achieved an AUC of 0.90 for distinguishing aggressive from indolent tumors.
Tumor size significantly improved classification accuracy when combined with CT and clinical data.
Random forest and multi-layer perceptron models showed complementary strengths in different classification tasks.
Abstract
This study aimed to develop and evaluate a machine learning pipeline using multiphase contrast-enhanced CT images and clinical data to classify renal tumors as benign, malignant-indolent, or malignant-aggressive, while assessing the contribution of each data source to the classification. In this retrospective study, 448 patients (mean age: 60.7 ± 12.6 years, 306 male, 142 female) who underwent nephrectomy and preoperative CECT between June 2008 and July 2018 were included. Tumors were histologically categorized as benign-indolent, malignant-indolent, or malignant-aggressive. Self-supervised feature extraction converted 4-phase CECT images into 512 real-valued features, combined with clinical data and tumor size for classification. Two machine learning classifiers, random forest (RF) and multi-layer perceptron (MLP), were used to predict tumor type. Nested five-fold cross-validation was…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRenal cell carcinoma treatment · MRI in cancer diagnosis · Radiomics and Machine Learning in Medical Imaging
Introduction
Kidney cancer is among the top 10 most prevalent cancers in both men and women in the United States and ranks 14th globally. According to the American Cancer Society, approximately 81,610 new cases and 14,390 deaths were projected in the United States in 2024 alone. Prognosis varies widely by stage, with 5-year survival rates of 93% for localized disease, 74% for regional spread, and just 17% for metastatic disease.
Renal tumors present a significant clinical challenge as they are frequently asymptomatic and discovered incidentally during imaging studies [1–4]. The increased utilization and availability of cross-sectional imaging, such as computed tomography and magnetic resonance imaging, have led to stage migration, with renal masses now being identified at smaller sizes and earlier stages before symptoms suggest advanced disease [1,2,4,5]. Contrast-enhanced CT (CECT) has become the cornerstone of renal mass evaluation, providing critical information about tumor morphology and enhancement patterns [2,6]. While most incidentally-detected renal masses are simple cysts, benign entities, such as angiomyolipomas and oncocytomas, they must be distinguished from malignant renal cell carcinomas (RCCs), which include clear cell, papillary, and chromophobe subtypes [2,5,7–10]. RCC is the most lethal urologic malignancy, with over 15% of patients presenting with metastatic disease at diagnosis [2,11,12]. Conventional diagnosis relies on visual interpretation of imaging, which is subject to limitations; accurate qualitative analysis of CECT images is hindered by interobserver variability, intra- and intertumoral heterogeneity, and challenges in comparing contrast differences across imaging phases [2,8–10].
While tumor size has been shown to correlate with the likelihood of malignancy, size alone, as well as benignity and malignancy categorization, are insufficient to guide modern clinical decision-making [3,13–16]. For example, small renal masses (SRMs), defined as tumors ≤4 cm, now account for up to 66% of new RCC diagnoses, and they are generally categorized into three groups: benign tumors, aggressive cancers, and indolent cancers [4]. Although 70–80% of SRMs are malignant, many exhibit indolent behavior over time, as demonstrated in active surveillance series [15,17]. Consequently, not all malignant SRMs require immediate treatment, and indolent malignant tumors can often be managed with active surveillance [4,15,18,19]. Management options for SRMs, including surgical resection, thermal ablation, and active surveillance, must be carefully chosen to avoid overtreatment of indolent tumors or undertreatment of aggressive ones [4,18,19]. Bhindi et al. proposed a framework that stratifies renal masses into histologically indolent and aggressive subgroups based on radiographic size and gender, offering a foundation for individualized treatment decisions [15]. However, distinguishing between aggressive and indolent disease remains challenging, even when imaging findings are supplemented with renal mass biopsy data [4,15,20].
Improved prognostic tools are needed to enhance risk stratification and provide tailored treatments based on oncologic risk and overall health [4,15,21]. Advances in machine learning and artificial intelligence offer opportunities to leverage radiographic and clinical data for more precise tumor characterization and behavior prediction [22]. While several studies use these tools to classify renal tumors as malignant or benign [23–30], few address their classification as aggressive or indolent [31,32]. In the present study, we fill this gap and quantify the extent to which various data sources, including CECT images, clinical markers, and tumor size contribute to this classification.
In this study, we develop a novel classification pipeline to identify renal tumor aggressiveness using multiphase CECT images and clinical features. Self-supervised feature extraction transforms the CECT images into real-valued features [33], which are then combined with tumor size and clinical data in tabular form for classification using a multi-layer perceptron (MLP) and a random forest (RF) algorithm. Tumors are categorized as benign and indolent, malignant and indolent, or malignant and aggressive. These classifications determine whether a tumor is benign or malignant and, if malignant, whether it exhibits indolent or aggressive behavior.
Materials and methods
The overall classification pipeline is presented in Fig 1. The input comprises 4-phase CECT images, clinical features, and tumor size measured using the CECT images. In the first step, images are transformed to real-valued features of dimension 512 using self-supervised feature extraction (Sim-CLR). Once this is accomplished, all input data is in tabular form and can be concatenated and used in a classifier. Two types of classifiers, one based on a multi-layer perceptron (MLP) and the other on a random forest (RF) algorithm are considered. The effect of dimensionality reduction in the input features is also considered. Different permutations of the input data features are used in the MLP and RF classifiers, and each tumor is classified into one of three categories: benign and indolent, malignant and indolent, and malignant and aggressive. Finally, these classifications are combined to determine whether a given tumor is benign or malignant and indolent or aggressive.
Schematic representation of the pipeline.
Patient data
This study was approved by the University of Southern California Institutional Review Board (IRB) under approval number HS-16–00479. Formal written consent was obtained from patients for participation in a kidney mass data collection project, which has been ongoing for over two decades. This study retrospectively included 448 patients with renal masses at Keck Medical Center of USC who underwent partial or radical nephrectomy and preoperative 4-phase CECT of the abdomen and pelvis between June 2008 and July 2018 (Table 1). Post-resection, standard-of-care pathologic evaluation was performed by experienced genitourinary pathologists. The subjects used in this study overlap with those of previous studies [10,34].
Table 1: Description of study population.
Image data was deidentified in MATLAB by assigning anonymous study IDs and stored on a secure, password-protected server. Using Synapse 3D (Fujifilm) and blinded to pathologic diagnoses, two senior radiologists-in-training manually segmented renal tumors as three-dimensional regions of interest [34,35]. Images were resampled to a median spacing using nearest-neighbor interpolation, min-max normalized, and center cropped. For each phase, the axial slice passing through the centroid was extracted from the 3D image to create a 2D image with 256x256 pixels. Tumors were segmented by three radiology residents with different levels of experience (PGY-2, PGY-3, and PGY-5) and processed to yield shape metrics. Thereafter, ICC3,1 (two-way mixed with absolute agreement and single measurement) was performed on these features. The average ICC across all features was found to be 0.97, which is very high. More details are reported in [35]. A radiologist with 20 years of abdominal imaging experience verified segmentation accuracy.
Tumor size was determined as the largest diameter of the tumor across the imaging plane, measured in the nephrographic phase. Clinical features include demographic data, past medical history, and symptomatic data. The demographic data includes body mass index (BMI), age, and sex; past medical history includes the status for diabetes mellitus (DM), hypertension (HTN), and smoking measured in pack-years. Symptomatic data include gross hematuria, microhematuria, fatigue, weight loss, bleeding, pathologic fracture, constipation, and flank mass. Integer- and real-valued data are normalized to lie between (0, 1), and features like DM and HTN status are represented using binary encoding.
Samples were defined as benign, malignant and indolent, or malignant and aggressive based on histological classification as outlined by Bhindi et al. [15] and shown in S1 Table. Benign renal masses include oncocytoma, papillary and metanephric adenomas, and non-epithelioid angiomyolipoma. Malignant indolent tumors encompass low-grade clear-cell renal cell carcinoma (ccRCC) and papillary RCC, chromophobe RCC, clear-cell papillary RCC, mucinous tubular and spindle cell RCC, succinyl dehydrogenase-deficient RCC, epithelioid angiomyolipoma, and tubulocystic RCC regardless of grade. Tumors were defined as aggressive if they exhibit sarcomatoid differentiation or coagulative necrosis, except in cases of low-grade (International Society of Urological Pathology grade 1–2) papillary RCC. Malignant aggressive tumors include high-grade ccRCC and papillary RCC, collecting duct RCC, translocation-associated RCC, hereditary leiomyomatosis RCC, unclassified RCC, and other malignant non-RCC tumors. They also included all tumors with stage greater than T1a and size larger than 4 cm. Note that this definition was driven by downstream patient management tasks, viz., tumors that are likely to be managed by active surveillance were precluded from the malignant aggressive category. We recognize that this stratification may not be universal and may be subject to change in future studies. For the aggressive versus indolent classification task, all malignant and aggressive samples were included in the aggressive class, and all other samples were included in the indolent class. Whereas, for the malignant versus benign classification task, all malignant and indolent and malignant and aggressive samples were included in the malignant class, and all benign samples were included in the benign class. These classifications are described schematically in Fig 2.
Schematic showing how histopathology-based ground truth is used to define two binary classification tasks.
Feature extraction
Images were mapped into real-valued features which were then treated in the same way as other tabular data. This was accomplished by converting each 256x256x4 CECT image into a vector of dimension 512 using a contrastive learning approach called SimCLR. In this approach, a class of similar images is created by applying transformations like rotation, scaling, and blurring to a given image, and a class of contrastive images is created by using two distinct images from the data set. Thereafter, a classifier is trained to distinguish between images in the two classes, and the last-but-one layer (latent layer) of the classifier is retained as a low-dimensional feature space for the image. Once the classifier is trained, any new image can be passed through this network, and the value of the latent layer is recovered as a feature embedding of that image. This approach is commonly referred to as SimCLR [33], and it has been shown to outperform other techniques for extracting features from images [33]. For details, please refer to S1 Appendix.
Dimension reduction
The SimCLR features represent a dimension reduction approach that transforms 256x256x4 = 262,144 values for each image to 512 features. Further reduction was achieved by applying principal component analysis (PCA) to the SimCLR features to identify the informative directions in our dataset. It was observed that the first 20 principal components accounted for 98% of the total variance in the data and were therefore used as a low-dimensional representation of each image.
The number of clinical features was reduced from 20 to 9 by applying the feature importance technique within the RF classifier to find the most important clinical features.
Classification
Any tumor can be uniquely categorized into one of three categories: “benign and indolent”, “malignant and indolent” and “malignant and aggressive”. Two types of classifiers were developed for this three-class classification problem: random forests (RFs) and multi-layer perceptrons (MLPs). The Gini index was used to train the RF network and cross-entropy loss was used for the MLP. The network architecture for all the models is relatively simple and is described in S2 Appendix. For the MLP models, a learning rate of 3e-4 and L2 regularization factor of 8e-4 were used and the parameters were initialized using the default PyTorch initialization.
The classifier outputs three logits corresponding to the three histopathological categories: benign, malignant and indolent, and malignant and aggressive. These logits were used to derive the probabilities corresponding to the two classification tasks as follows. For the malignant versus benign classification, the probability of malignancy was computed as P(malignant) = 1 − P(benign). For the aggressive versus indolent classification, the probability of aggressiveness was computed as P(aggressive) = P(malignant and aggressive).
Nested five-fold cross-validation was used for hyperparameter tuning and model evaluation, and the area under the curve (AUC) for the receiver operating characteristic curve was reported for each model (see S3 Appendix for details).
Results
Results for all models are presented in Table 2 and Fig 3. Models 1–3 use a single source of input data; models 4 and 5 use clinical features and tumor size; models 6–9 use all sources of input data.
Table 2: AUC and 95% confidence interval (CI) for different classification models. Models are characterized by their inputs and the type of classifier.
Receiver Operating Characteristic (ROC) curves for Aggressive vs. Indolent and Malignant vs. Benign classifications for all models.
Model 1 is an MLP that uses image embeddings to perform the classification. Its performance was found to be acceptable for both aggressive versus indolent and malignant versus benign classifications with AUCs of 0.86 and 0.73, respectively. Model 2 uses only clinical features as input and underperformed for both types of classifications.
Model 2 was used to determine the relative feature importance values for the clinical features, and these results are shown in Fig 4. Thereafter, Model 3 was trained using the nine most important clinical features and its performance was found to be comparable to that of Model 2.
Relative feature importance for clinical data as determined by the random forest model.BMI = Body Mass Index, PPY = Pack Year, DM = Diabetes Mellitus, HTN = Hypertension.
In Model 4, clinical features and the tumor size determined from CT images were used as input and it was observed that its performance for aggressive vs indolent classification was significantly better than models that only used clinical features (Models 2 and 3). However, its performance in classifying malignant vs benign lesions was not as good as Model 1, which used images as input. Model 5 was designed to be similar to Model 4; however, it used the reduced set of 9 clinical features. It was observed that its performance was similar that of Model 4.
Models 6 and 7 used all sources of input data and differed from each other in the type of classifier. Model 6 used an RF classifier, while Model 7 used an MLP classifier. It was observed that the RF model (Model 6) performed better on the aggressive versus indolent classification task, while the MLP model (Model 7) performed better on the malignant versus benign classification task.
Models 8 and 9 used the first 20 principal components of the image embeddings as input. Working with this reduced set did not significantly affect the performance of the models and the change in the AUC between Models 6 and 8 and Models 7 and 9 was less than or equal to 0.02 points.
Precision–recall AUC (PR–AUC) values are reported in Tables 3 and 4. For the malignant vs. benign classification, the positive class (malignant) is dominant and therefore the PR–AUC for this classification is generally larger than that for the aggressive vs. indolent classification, where the positive class (aggressive) is rare. For the aggressive vs. indolent classification, the highest PR–AUC of 0.82 is obtained by Model 8, whereas for the malignant vs. benign classification, the highest PR-AUC of 0.90 is obtained by Model 9.
Table 3: Precision-recall AUC, components and threshold-based performance metrics (Sensitivity, Specificity, Precision, and F1-score) for each model in the Aggressive vs Indolent classification.
Table 4: Precision-recall AUC, components and threshold-based performance metrics (Sensitivity, Specificity, Precision, and F1-score) for each model in the Malignant vs Benign classification).
Confusion matrices and thresholded metrics are reported in Tables 1 and 2. The Youden J metric [36] was computed for each validation set and used to determine the optimal threshold. This value was then averaged over all validation sets, and the resulting value was used as a threshold for each classifier. The confusion matrix and the values of sensitivity, specificity, precision, and F1-score obtained using this threshold are reported in Tables 3 and 4.
Model calibration was evaluated by comparing predicted probabilities with the observed outcome frequencies. Predicted probabilities were divided into ten equally spaced bins between 0 and 1. For each bin, the mean predicted probability and the corresponding observed proportion of positive cases were computed. Nonparametric bootstrapping was performed within each bin to estimate the uncertainty around the observed proportions. Specifically, for each bin, the true labels were resampled with replacement 1,000 times, and the 95% confidence intervals of the resulting bootstrap distribution were calculated. The calibration plot (Fig 5) displays the mean predicted probability versus the observed frequency, with error bars representing the 95% confidence intervals. Wider intervals typically occur in bins with fewer samples. Calibration performance was also quantified using the Brier score [37] for each model and classification, as summarized in Table 5. Lower Brier scores indicate better-calibrated predictions, with values ranging from 0.13 to 0.23 across models and classification tasks. Overall, these analyses provide confidence that the trained models are generally well calibrated.
Table 5: Brier scores for each model for Aggressive vs. Indolent and Malignant vs. Benign classifications. Brier scores quantify the accuracy of predicted probabilities, providing a measure of model calibration, with lower values indicating better-calibrated predictions.
Calibration plots with 95% bootstrap confidence intervals for all nine classification models.The dashed diagonal line indicates perfect calibration, where predicted and observed probabilities are equal. Wider intervals in some bins reflect the smaller number of samples in those bins.
Statistical analysis
In Table 6, we report the difference in the AUC for the aggressive versus indolent classification between pairs of models and the 95% confidence interval (CI) for this difference. The value reported in the (i, j) cell of this matrix is equal to AUC (Model j) - AUC (Model i), and the corresponding 95% CI is derived from Delong’s Z test. Thus, the model considered in column j is superior to the model in row i if this value is positive. Further, this assertion is statistically significant when the 95% CI does not include zero. For example, Model 5 is statistically superior to Model 3 since the difference in the AUC is 0.33, and the 95% CI of this difference does not include zero.
Table 6: Difference in the AUC between models for classifying lesions as aggressive or indolent. For column j, and row i, each cell in the table reports the difference between the AUC of Model i and Model j. It also contains (in parenthesis) the 95% confidence interval (CI) of this difference.
From Table 6, we observe that Models 5, 6, and 8 are statistically superior to all other models for the aggressive versus indolent classification. We also observe that the difference among these three models is very small, and we may claim that their performance is equivalent.
In Table 7, we report the corresponding values for the difference in the AUC for the malignant versus benign classification. We conclude that Model 9 is statistically better than all models except Models 6 and 7. The 95% CI of the difference in the AUC between Model 9 and Model 6 and Model 9 and Model 7 includes zero; therefore, the performance of these models may be considered equivalent. However, when we enumerate the number of models that Models 6, 7, and 9, are superior to, we conclude that this number is 4 for Models 6 and 7, and 6 for Model 9. Based on this comparison, we conclude that Model 9 is preferable to Models 6 and 7.
Table 7: Difference in the AUC between models for classifying lesions as malignant or benign. For column j, and row i, each cell in the table reports the difference between the AUC of Model i and Model j. It also contains (in parenthesis) the 95% confidence interval (CI) of this difference.
Discussion
When considering models that utilize all input sources, we observe that the models can classify aggressive versus indolent lesions more accurately than malignant versus benign lesions. The best-performing models for the former (Models 5 and 8) have an AUC of 0.90, whereas the best-performing model (Model 9) for the latter has an AUC of 0.76. The high accuracy in classifying aggressive versus indolent lesions implies that we can provide the physician with a sufficiently accurate assessment so that they may recommend different treatments for patients in these two groups.
When differentiating aggressive lesions from indolent lesions, the knowledge of the size of the lesion plays an important role. This is concluded by comparing the AUC for Models 2 and 3, whose input does not include tumor size information, with all other models, whose input includes tumor size information. In each case, the AUC for Model 2 or 3 is smaller, and this difference is statistically significant.
The judicious reduction in the data dimensions does not influence the performance of the model. The dimension of clinical data is reduced from 20 to 9, and that of image features is reduced from 512 to 20. To assess the effect of the former, we compute the difference between the AUCs of Models 2 and 3 and observe that the 95% CI interval of this difference includes zero, thereby indicating that the difference between these models is not significant. For the latter, we compute the difference in the AUCs between Models 6 and 8, and Models 7 and 9 and once again conclude that the performance of the models with the full and reduced set of features is comparable. The advantage of working with a smaller set of features is two-fold. First, for clinical features, it means that the user collects and curates a smaller dataset. Second, for both clinical and image features, it means that the resulting model has fewer trainable parameters and, therefore, requires less labeled data to train.
When considering all sources of data and a reduced set of features (Models 8 and 9), the RF-based classifier (Model 8) performs better when classifying aggressive versus indolent lesions, while the MLP-based classifier performs better when classifying malignant versus benign lesions. The difference in the AUCs for these models is significant. While it is difficult to tease out why this is the case, this observation implies that a simple partition of the data space into hyper-rectangles accomplished by the RF algorithm is optimal in distinguishing aggressive and indolent tumors. On the other hand, distinguishing malignant and benign tumors requires a more complex partition which is achieved by an MLP.
Models 5, 6, and 8 performed the best when classifying aggressive versus indolent lesions. Among these, Model 5 is particularly attractive because of its simplicity. Its input comprises only the nine important features, and the tumor size is estimated using a single slice of the CT image. The small number of input parameters also implies that the model is easy to train and does not require much data for robust performance.
For classifying malignant versus benign lesions, Model 9 performed the best. This input to this model is also low-dimensional as it includes 9 clinical features, 20 features derived from CECT images, and a single feature that represents the tumor size. However, generating this input requires a significant amount of data collection and processing. It requires all four phases of the CECT images, computing the SimCLR features for these images, dimension reduction via PCA, and explicit calculation of the tumor size. This underlines the fact that distinguishing malignant and benign lesions is a hard problem that requires data from multiple sources.
In the long term, the models developed in this study can be used as an assistive tool in clinical practice to diagnose renal masses as aggressive and/or malignant. The ability to distinguish aggressive tumors from indolent tumors can be used as an improved prognostic tool to enhance risk stratification and ensure that each patient receives tailored treatment based on their oncologic risk and overall health. Future work will focus on developing methods that explicitly model missing or partially available CECT phases, as such scenarios are prevalent in clinical deployment. Furthermore, we plan to perform external validation across multiple institutions to rigorously evaluate generalizability.
Supporting information
S1 AppendixRepresentation learning.(DOCX)
S2 AppendixModel Architecture.(DOCX)
S3 AppendixNested Five-Fold Cross-Validation.(DOCX)
S1 TableHistologic classification of renal masses.(DOCX)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Krabbe LM, Bagrodia A, Margulis V, Wood CG. Surgical management of renal cell carcinoma. Thieme Medical Publishers. 2014.10.1055/s-0033-1363840 PMC 393065624596437 · doi ↗ · pubmed ↗
- 2Krajewski KM, Pedrosa I. Imaging Advances in the Management of Kidney Cancer. J Clin Oncol. 2018;36(36):JCO 2018791236. doi: 10.1200/JCO.2018.79.1236 30372386 PMC 6299343 · doi ↗ · pubmed ↗
- 3Hollingsworth JM, Miller DC, Daignault S, Hollenbeck BK. Rising incidence of small renal masses: a need to reassess treatment effect. J Natl Cancer Inst. 2006;98(18):1331–4. doi: 10.1093/jnci/djj 362 16985252 · doi ↗ · pubmed ↗
- 4Shapiro DD, Abel EJ. Predicting aggressive behavior in small renal tumors prior to treatment. Ann Transl Med. 2018;6(Suppl 2):S 132. doi: 10.21037/atm.2018.12.46 30740453 PMC 6330616 · doi ↗ · pubmed ↗
- 5Wang ZJ, Westphalen AC, Zagoria RJ. CT and MRI of small renal masses. Br J Radiol. 2018;91(1087):20180131. doi: 10.1259/bjr.20180131 29668296 PMC 6221773 · doi ↗ · pubmed ↗
- 6Bianchi M, Sun M, Jeldres C, Shariat SF, Trinh Q-D, Briganti A, et al. Distribution of metastatic sites in renal cell carcinoma: a population-based analysis. Ann Oncol. 2012;23(4):973–80. doi: 10.1093/annonc/mdr 362 21890909 · doi ↗ · pubmed ↗
- 7Prasad SR, Dalrymple NC, Surabhi VR. Cross-sectional imaging evaluation of renal masses. Radiol Clin North Am. 2008;46(1):95–111, vi–vii. doi: 10.1016/j.rcl.2008.01.008 18328882 · doi ↗ · pubmed ↗
- 8Barrisford GW, Singer EA, Rosner IL, Linehan WM, Bratslavsky G. Familial renal cancer: molecular genetics and surgical management. Int J Surg Oncol. 2011;2011:658767. doi: 10.1155/2011/658767 22312516 PMC 3263689 · doi ↗ · pubmed ↗
