Ultrasound-Based Deep Learning Radiomics Models for Predicting Primary and Secondary Salivary Gland Malignancies: A Multicenter Retrospective Study
Zhen Xia, Xiao-Chen Huang, Xin-Yu Xu, Qing Miao, Ming Wang, Meng-Jie Wu, Hao Zhang, Qi Jiang, Jing Zhuang, Qiang Wei, Wei Zhang

TL;DR
This study uses ultrasound-based deep learning and radiomics to better distinguish between primary and secondary salivary gland tumors, offering a non-invasive diagnostic tool.
Contribution
The novel contribution is a combined radiomics-deep learning model that outperforms traditional methods in differentiating salivary gland malignancies.
Findings
The RadiomicsDL model achieved an AUC of 0.807, outperforming other models like radiomics (0.636) and deep learning alone (0.763).
SHAP analysis identified Wavelet_LHH_glcm_SumEntropy as the most significant radiomic feature in the model.
Abstract
Background: Primary and secondary salivary gland malignancies differ significantly in treatment and prognosis. However, conventional ultrasonography often struggles to differentiate between these malignancies due to overlapping imaging features. We aimed to develop and evaluate noninvasive diagnostic models based on traditional ultrasound features, radiomics, and deep learning—independently or in combination—for distinguishing between primary and secondary salivary gland malignancies. Methods: This retrospective study included a total of 140 patients, comprising 68 with primary and 72 with secondary salivary gland malignancies, all pathologically confirmed, from four medical centers. Ultrasound features of salivary gland tumors were analyzed, and a radiomics model was established. Transfer learning with multiple pre-trained models was used to create deep learning (DL) models from which…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4- —Research Project of Jiangsu Cancer Hospital
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSalivary Gland Tumors Diagnosis and Treatment · Radiomics and Machine Learning in Medical Imaging · Oral and Maxillofacial Pathology
1. Introduction
Salivary gland malignancies can be categorized into primary tumors, which include epithelial malignancies such as mucoepidermoid carcinoma and lymphomas represented by extranodal marginal zone B-cell lymphoma of mucosa-associated lymphoid tissue (MALT Lymphoma), and secondary tumors originating from metastases [1]. The distribution of secondary salivary gland malignancies varies regionally, with metastases from head and neck squamous cell carcinoma accounting for approximately 73%‒100% of cases [2,3]. Treatment and prognoses differ significantly between primary and secondary malignancies. While most patients undergo parotidectomy, neck lymph node dissection, or radiotherapy, the 5-year survival rate for secondary salivary gland squamous cell carcinoma remains substantially lower than that for primary malignancies (32.6% vs. 77.2%) [4]. Therefore, early identification of secondary salivary gland malignancies is crucial.
Although clinical data, imaging examinations, and fine needle aspiration (FNA) [5] can provide initial insights into whether a lesion is benign or malignant, accurate tumor classification still relies heavily on core needle biopsy or postoperative pathology. This reliance can lead to delayed diagnoses or wrong assessments, increasing surgical risks or resulting in missed treatment opportunities [6].
Ultrasound remains one of the preferred imaging modalities for salivary gland diseases. It is limited in the differential diagnosis of salivary gland tumors due to overlapping diagnostic features in the sonographic images [7]. While elastography and contrast-enhanced ultrasound techniques show promise, they have yet to be widely adopted [8]. Recently, radiomics and deep learning have demonstrated potential in the non-invasive differentiation of benign and malignant salivary gland tumors [9], as well as in distinguishing between different pathological subtypes of benign tumors [10,11,12].
However, no ultrasound-related studies have focused on the differentiation of secondary malignant salivary gland tumors, despite their significant proportion and the clear differences in treatment regimens compared to primary tumors. We aimed to develop a model integrating radiomics and deep learning using a retrospective analysis of ultrasound images from four major medical centers, providing a noninvasive approach to characterize salivary gland malignancies.
2. Materials and Methods
2.1. Patients
This retrospective study analyzed patients diagnosed with salivary gland malignancies across four centers in two regions. The inclusion criteria were as follows: patients with histopathologically confirmed salivary gland malignancies through surgical resection or biopsy, those who underwent preoperative ultrasound examination, and those with complete clinical data. The exclusion criteria were patients with poor-quality ultrasound images impeding accurate diagnosis, salivary gland tumors resulting from the invasion of adjacent malignant tumors, recurrent salivary gland malignancies following total resection, and cases where pathology could not definitively confirm whether malignancies were primary or secondary.
A total of 140 patients were ultimately enrolled, including 111 from Jiangsu Cancer Hospital, who were split in a 7:3 ratio into training and internal validation sets. An additional 29 patients from the First Affiliated Hospital of Nanjing Medical University, Affiliated Hospital of Nantong University, and the Affiliated Jiangning Hospital of Nanjing Medical University were used as the external test set. A flow diagram is shown in Figure 1. This study complied with the Declaration of Helsinki and was approved by the Ethics Committee of Jiangsu Cancer Hospital (No. KY-2024-057; Date: 1 July 2024). The requirement for individual consent was waived.
2.2. Histopathological Outcomes
Pathological diagnoses were based on the 5th Edition of the World Health Organization Classification of Head and Neck Tumors [13]. At Jiangsu Cancer Hospital, all pathological diagnoses were re-evaluated by a pathologist (X.-C.H.) with over 6 years of experience. This process involved reviewing original diagnoses and assessing patients’ clinical data to confirm whether the malignancies were primary or secondary. For the external test set, pathological and clinical data from three regional medical centers were collected by M.-J.W., H.Z., and Q.J. The final diagnoses were confirmed by X.-C.H. using consistent pathological and clinical criteria (Tables S1 and S2 in Supplementary Materials).
2.3. Ultrasound Imaging
Due to the retrospective nature of the study, all ultrasound images were exported in a PNG format from the ultrasound report systems and stored on a computer, which ensured high image quality. The ultrasound devices used in the study include Mylab Twice and MyLab 90 (Esaote, Genoa, Italy), HI VISION Preirus (HITACHI, Tokyo, Japan), S2000 (Siemens Healthineers, Erlangen, Germany), LOGIQ E20 (GE Healthcare, Chicago, IL, USA), and ALOKA ARIETTA 850 (FUJI, Tokyo, Japan), all equipped with linear high-frequency probes (with a frequency range of approximately 5 to 12 MHz). The ultrasound images were reviewed by M.W., an experienced ultrasound physician with over 15 years of expertise at a hospital comparable in clinical and diagnostic capabilities to the four participating centers, while being blinded to the pathological diagnosis. To assess lesion features, ultrasound evaluations were guided by the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) [14], as no established diagnostic standards exist for salivary gland malignancies. Ultrasound features evaluated included composition, echogenicity, shape, aspect ratio, margin, calcification, and posterior acoustic characteristics.
2.4. Labeling
The ultrasound physician (Z.X.), with 6 years of experience, adhered to standardized procedures and was blinded to lesion pathology during annotation. Regions of interest (ROIs) were delineated using ITK-SNAP (version 4.0.2, www.itksnap.org) (accessed on 20 December 2024), encompassing the entire tumor mass while excluding non-tumorous surrounding tissues. The delineated ROIs were then exported to the Neuroimaging Informatics Technology Initiative (NIFTI) format for subsequent model training.
2.5. Radiomics Features Extraction
The PyRadiomics library was employed to extract radiomics features using a multistep approach. The analyzed image types included the original image, along with various transformed versions, including Wavelet, Square, SquareRoot, Logarithm, Exponential, and Gradient. The extracted features were categorized into three groups: geometry, intensity, and texture. Geometry features describe the two-dimensional shape characteristics of the tumor. Intensity features capture the first-order statistical distribution of voxel intensities within the ROI. Texture features capture patterns and higher-order spatial distributions of intensities, and are extracted using multiple methods, including the Gray level co-occurrence matrix (GLCM), Gray level dependence matrix (GLDM), Gray level run length matrix (GLRLM), Gray level size zone matrix (GLSZM), and Neighboring gray tone difference matrix (NGTDM). Additionally, for three-dimensional features, the third dimension was set to 1 to accommodate specific computational requirements.
2.6. Radiomics Features Selection
Feature selection was conducted using the following methods: (1) Z-score standardization was applied to remove scale effects across all features; (2) independent sample t-tests or Mann–Whitney U tests were used to calculate the p-values for all features between the primary and secondary tumor groups, and features with p-values less than 0.05 were retained for further analysis; (3) Spearman correlation analysis was used to remove redundant features. Features with a correlation coefficient greater than 0.9 were considered highly correlated, and only one feature from each pair was retained to reduce redundancy; and (4) The Least Absolute Shrinkage and Selection Operator (LASSO) regression algorithm, combined with five-fold cross-validation, was employed to further eliminate irrelevant features [15]. The final selected features were used for modeling.
2.7. Deep Learning Training
Several deep learning models with ImageNet pre-trained weights were used for training and validation [16]. The training, internal validation, and external test datasets were loaded based on their respective class labels and were normalized using the ImageNet standard. The models were trained with a stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.01, a batch size of 32, and 50 epochs. During training, model performance was evaluated using metrics including accuracy, precision, recall, and F1 scores. Additionally, confusion matrices and receiver operating characteristic (ROC) curves were generated to assess classification performance further.
2.8. Radiomics-Deep Learning (RadiomicsDL) and Combined Models
Deep learning features were extracted from the global average pooling (avgpool) layer of the trained model. The classification layer was removed, and the avgpool output, which captured high-level image semantics, was used as the feature vector. The input data underwent forward propagation and the extracted features were organized into a matrix. Principal component analysis (PCA) was applied to reduce dimensionality while retaining key information. Compressed feature vectors were used for modeling, thereby improving the efficiency and performance.
Compressed deep learning features were combined with selected radiomics features, and key features were retained through dimensionality reduction, similar to the radiomics features selection. These were used to develop the RadiomicsDL model, which was then combined with the ultrasound features to create the combined model.
2.9. Machine Learning Modeling
Six classical machine learning models (Logistic Regression [LR], Support Vector Machine [SVM], Random Forest, eXtreme Gradient Boosting [XGBoost], Light Gradient Boosting Machine [LightGBM], and Multi-Layer Perceptron [MLP]) were used for modeling. After training, ROC curves were plotted to compare the area under the receiver operating characteristic curve (AUC) across the training, internal validation, and external test datasets. The model with the best performance on the external test dataset was selected as the final model.
2.10. Model Interpretability
Gradient-weighted Class Activation Mapping (Grad-CAM) and SHapley Additive exPlanations (SHAP) techniques were utilized to improve the interpretability of the DL and RadiomicsDL models.
Grad-CAM highlights the key regions that influence classification by generating heat maps from the gradients of the target class with respect to convolutional feature maps. Overlaying these heat maps on ultrasound images reveals areas critical to the model’s performance and identifies clinically relevant features [17].
SHAP quantifies the contribution of individual features to the predictions. Globally, it identifies dominant features, while locally explaining individual predictions by visualizing the direction and magnitude of contributions, further improving interpretability [18].
2.11. Software and Statistical Analysis
Statistical analyses were performed using Python (version 3.7) and R (version 3.6.1). Categorical variables were presented as frequency (n) and percentage (%), while continuous variables were presented as mean ± standard deviation (SD). Group differences were assessed with chi-square or Fisher’s exact test for categorical variables, and independent samples t-tests or Mann–Whitney U tests for continuous variables. Correlation analyses were performed using Pearson’s or Spearman’s coefficients, as appropriate. The diagnostic performance of the model was evaluated using the AUC, sensitivity, specificity, and accuracy. Statistical differences between model performances were tested using Delong’s test. Statistical significance was set at p < 0.05.
3. Results
3.1. Clinical Characteristics
A total of 140 patients with 242 ultrasound images were included in the analysis. Of these, 111 patients from Jiangsu Cancer Hospital served as the development cohort, contributing 187 images. The images were randomly divided into the training set (130 images) and internal validation set (57 images) in a 7:3 ratio. The remaining 55 images, sourced from three other centers, formed the external test cohort. To prevent data leakage, we ensured that images from the same patient were not included in different subsets. A comparison of baseline characteristics between the development and external test cohorts revealed no significant differences (Table 1).
3.2. Ultrasound Features
The ultrasound (US) features were compared across the three groups based on pathological type (Table 2). The analysis identified statistically significant differences in the aspect ratio and posterior echoes within the training set. Univariate and multivariate logistic regression analyses determined that an aspect ratio <1 and enhanced posterior echoes were independent risk factors (Table 3). These features were subsequently incorporated into a logistic regression model.
3.3. Radiomics Modeling
A total of 1288 radiomics features were extracted using PyRadiomics, including 252 firstorder features, 14 shape features, 308 GLCM, 196 GLDM, 224 GLRLM, 224 GLSZM, and 70 NGTDM features. All features were extracted using an in-house feature analysis program implemented in PyRadiomics (http://pyradiomics.readthedocs.io) (accessed on 20 December 2024). After normalization and statistical tests (t-tests or Mann–Whitney U tests) between the primary and secondary tumor groups, 55 statistically significant features were identified. Figure 2 shows all features and corresponding pvalue results. Spearman’s correlation analysis was applied, retaining 24 features by eliminating highly correlated pairs (r > 0.9). Using LASSO regression with five-fold cross-validation, the feature set was refined to 10 features with nonzero coefficients for subsequent modeling (Figure S1A–C in Supplementary Materials). Among the six machine learning algorithms evaluated, the SVM performed best, achieving AUC values of 0.841, 0.767, and 0.636 for the training, internal validation, and external test sets, respectively (Table 4).
3.4. Deep Learning Modeling
Eight commonly used ImageNet pre-trained models were systematically evaluated. Among these, Resnet50 demonstrated the best performance on the external test set and was selected for further analysis. The AUC values for the training, internal validation, and external test sets were 0.848, 0.746, and 0.763, respectively (Table 5).
3.5. Development of the RadiomicsDL Model with Integrated Features
A total of 2048 deep learning features were extracted from the average pooling layer of the trained DL model. These were reduced to 10 key components and combined with radiomics features. Through feature selection, three significant features were identified: DL_0, exponential_gldm_LargeDependenceHighGrayLevelEmphasis, and wavelet_LHH_glcm_SumEntropy (Figure S1D–F in Supplementary Materials). Machine-learning modeling revealed that the MLP model performed best. The resulting RadiomicsDL model achieved AUC values of 0.771 for the internal validation set and 0.807 for the external test set (Table 6).
3.6. Comparison of Model Performance
The diagnostic performance of US, Radiomics, DL, RadiomicsDL, and combined models (integrating US and RadiomicsDL) was compared using receiver operating characteristic (ROC) curves and the DeLong test. The RadiomicsDL model achieved the highest AUC in the external test set (0.807), significantly outperforming the US and Radiomics models. However, no significant differences were found between the DL and combined models (Table 7, Figure 3).
3.7. Interpretability Analysis of DL and RadiomicsDL Models
SHAP interpretability was applied to the RadiomicsDL model, and the SHAP Summary Plot (Figure 4A) was generated. The results showed that, among the three features in the model (Wavelet_LHH_glcm_SumEntropy, Exponential_gldm_LargeDependenceHighGrayLevelEmphasis, and DL_0), the radiomic feature Wavelet_LHH_glcm_SumEntropy had the highest impact on the model’s predictions. Additionally, higher values of the two radiomics features were associated with a greater likelihood of primary tumors, whereas DL_0 showed the opposite relationship. SHAP waterfall plots, as shown in Figure 4C–E, are used to explain the prediction process of selected representative lesions. Grad-CAM heat maps were generated from the average pooling layer of the trained ResNet50 model to visualize the ROI in the ultrasound image (Figure 4B).
4. Discussion
In this study, radiomics and deep learning features were extracted from ultrasound images to develop predictive models, including US, Radiomics, DL, RadiomicsDL, and combined models. The models were externally validated across three independent central hospitals, demonstrating the robust diagnostic performance of the RadiomicsDL model. Notably, the RadiomicsDL model achieved an AUC of 0.807 in the external test dataset, effectively distinguishing between primary and secondary salivary gland malignancies. This result highlights the potential of RadiomicsDL for non-invasive clinical applications.
To the best of our knowledge, this is the first study to investigate the pathological classification of salivary gland malignancies using ultrasound imaging. In the training dataset, the aspect ratio and posterior echo were identified as statistically significant features distinguishing primary from secondary salivary gland malignancies. Specifically, an aspect ratio of <1 and posterior echo enhancement were independent indicators of secondary tumors. The US model, developed based on these features, achieved an AUC of 0.726 in the internal training set. However, its performance significantly declined in the external test dataset, with an AUC of only 0.421. This finding underscores the limitations of conventional ultrasound in accurately classifying malignant tumor subtypes. Similar challenges are reflected in its inconsistent sensitivity for distinguishing benign from malignant salivary gland tumors, previously reported to range from 38.9% to 88% [19]. Due to this limitation, the diagnostic performance of the combined model in the validation and test sets was hindered, which is also seen in the limitations of conventional ultrasound in the application to salivary gland tumors. The overlapping ultrasound characteristics among tumor types likely represent a key barrier to achieving higher predictive accuracy, highlighting the need for advanced diagnostic tools, such as radiomics or DL approaches, to improve pathological classification precision.
The Radiomics model achieved promising results in the training and internal validation datasets but showed signs of overfitting in the external test dataset. This finding suggests that, while radiomics models can achieve high accuracy within specific datasets, their generalizability and robustness across diverse datasets remain challenges. By contrast, the DL model demonstrated stable performance across all datasets, highlighting its ability to capture data complexity and adapt to heterogeneous data distributions. Previous studies have emphasized that integrating radiomics and DL features can enhance tumor differentiation, staging, and prognosis prediction compared with using either method alone [20,21]. This improvement was attributed to the multi-omics model incorporating additional critical parameters. The integration of radiomics and deep learning in the Radiomics_DL framework improved the AUC and resulted in optimal performance on both the internal test set and the external validation set. Compared to the standalone DL model, the Radiomics_DL model corrected several misclassifications, reducing the occurrence of false positives and false negatives (Figure 4). This further emphasizes the advantages of combining radiomics and deep learning, particularly in enhancing diagnostic accuracy.
In this study, we leveraged SHAP to analyze the interpretability of our proposed RadiomicsDL model, effectively visualizing the model’s evaluation process and prediction outcomes. The RadiomicsDL model was developed by integrating key deep learning features with selected radiomics features, combining two radiomics features and one deep learning-derived feature, with the SHAP summary plot identifying the radiomic feature Wavelet_LHH_glcm_SumEntropy as the most influential. This feature is derived from wavelet transform analysis of the GLCM. Our findings indicate that higher values of this feature correspond to an increased likelihood of primary tumor presence. This observation aligns with previous studies, which have demonstrated a correlation between this feature and favorable prognosis as well as reduced tumor invasiveness [22,23]. The integration of these two radiomics features with the deep learning-derived feature DL_0 significantly enhanced the model’s discriminative capability. Furthermore, leveraging SHAP for local interpretability analysis enables effective visualization of the model’s evaluation process and prediction outcomes.
Although this study provides encouraging preliminary results, it has several limitations. First, the retrospective design prevented the standardization of ultrasound image acquisition, and the analysis was limited to conventional ultrasound images, which may have constrained the model’s generalizability. Second, the relatively low incidence of salivary gland malignancies limited the sample size, despite cases being collected from multiple central hospitals. Variations in the regional distribution of pathological subtypes, potentially reflecting differences in population genetics or healthcare practices, may have introduced instability and reduced the reliability of the results. Moreover, since this study focused solely on binary classification of salivary gland malignancies, its applicability is limited. The failure to identify lymphomas separately is another limitation of this research.
To address these limitations, future studies should explore the integration of multimodal imaging data, such as adding Color Doppler Flow Imaging (CDFI) and elastography, to complement ultrasound findings and enhance diagnostic accuracy, particularly in further subtyping of tumors. Additionally, developing more generalized and versatile multilayer diagnostic models that can provide initial benign/malignant classification as well as further subtype classification for salivary gland tumors would be beneficial. Furthermore, collaborative efforts across multiple centers, along with the accumulation of large-scale datasets integrating clinical and genomic information, offer hope for building more comprehensive and robust diagnostic models.
5. Conclusions
In this study, we successfully extracted radiomics and deep learning features from salivary gland tumor ultrasound images. Through feature selection and machine learning, we developed a RadiomicsDL model capable of effectively distinguishing between primary and secondary salivary gland malignancies, thereby assisting clinicians in making accurate diagnoses.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alsanie I. Rajab S. Cottom H. Adegun O. Agarwal R. Jay A. Graham L. James J. Barrett A.W. Van Heerden W. Distribution and Frequency of Salivary Gland Tumours: An International Multicenter Study Head Neck Pathol.2022161043105410.1007/s 12105-022-01459-035622296 PMC 9729635 · doi ↗ · pubmed ↗
- 2He A. Lei H. Li H. Li X. Yang Y. Wang Y. Ong H. Zhao X. Ruan M. Han N. Metastatic Parotid Gland Malignancy: A Preliminary Study in an Eastern Chinese Population J. Stomatol. Oral Maxillofac. Surg.202312410130910.1016/j.jormas.2022.10.00836252929 · doi ↗ · pubmed ↗
- 3Mckenzie J. Lockyer J. Singh T. Nguyen E. Salivary Gland Tumours: An Epidemiological Review of Non-Neoplastic and Neoplastic Pathology Br. J. Oral Maxillofac. Surg.202361121810.1016/j.bjoms.2022.11.28136623970 · doi ↗ · pubmed ↗
- 4Meyer M.F. Wolber P. Arolt C. Wessel M. Quaas A. Lang S. Klussmann J.P. Semrau R. Beutner D. Survival after Parotid Gland Metastases of Cutaneous Squamous Cell Carcinoma of the Head and Neck Oral Maxillofac. Surg.20212538338810.1007/s 10006-020-00934-833400041 PMC 8352831 · doi ↗ · pubmed ↗
- 5Pusztaszeri M. Rossi E.D. Faquin W.C. Update on Salivary Gland Fine-Needle Aspiration and the Milan System for Reporting Salivary Gland Cytopathology Arch. Pathol. Lab. Med.20241481092110410.5858/arpa.2022-0529-RA 37226841 · doi ↗ · pubmed ↗
- 6HorákováM. Porre S. Tommola S. BaněčkováM. SkálováA. KholováI. FNA Diagnostics of Secondary Malignancies in the Salivary Gland: Bi-Institutional Experience of 36 Cases Diagn. Cytopathol.20214924125110.1002/dc.2462933017519 · doi ↗ · pubmed ↗
- 7van Herpen C. Poorten V.V. Skalova A. Terhaard C. Maroldi R. van Engen A. Baujat B. Locati L.D. Jensen A.D. Smeele L. Salivary Gland Cancer: ESMO–European Reference Network on Rare Adult Solid Cancers (EURACAN) Clinical Practice Guideline for Diagnosis, Treatment and Follow-Up†ESMO Open 2022710060210.1016/j.esmoop.2022.10060236567082 PMC 9808465 · doi ↗ · pubmed ↗
- 8Shi L. Wu D. Yang X. Yan C. Huang P. Contrast-Enhanced Ultrasound and Strain Elastography for Differentiating Benign and Malignant Parotid Tumors Ultraschall Med.20234441942710.1055/a-1866-463336731495 PMC 10629480 · doi ↗ · pubmed ↗
