A novel interpretable machine learning framework integrating clinicopathological and radiomic features for early recurrence prediction in mass-forming intrahepatic cholangiocarcinoma
Xiao-li Deng, Chongze Yang, Lan-hui Qin, Xue-feng Lin, Fen Zhong, Jin-yuan Liao

TL;DR
This study creates a transparent machine learning model that combines clinical and imaging data to predict early recurrence in a type of liver cancer called intrahepatic cholangiocarcinoma.
Contribution
A novel interpretable machine learning framework called IFSCI is introduced, integrating clinicopathological and radiomic features with SHAP-based explanations.
Findings
The MLP-based combined model achieved AUCs of 0.933 in training, 0.891 in internal validation, and 0.856 in external validation.
SHAP values provided global and case-level interpretability, clarifying the decision logic for individual predictions.
The model showed good calibration and clinical benefit across 10–30% threshold probabilities according to decision curve analysis.
Abstract
Early recurrence after curative resection remains a major determinant of poor prognosis in intrahepatic cholangiocarcinoma (ICC). Existing multimodal prediction models often lack interpretability due to feature interference during fusion. This study aimed to develop and externally validate an interpretable multimodal machine-learning model using a novel Independent Feature Selection and Consistent Integration (IFSCI) framework, combined with SHAP-based explanation to enhance transparency. A total of 264 patients with mass-forming ICC who underwent radical resection were retrospectively enrolled from two centers. Clinical and CT-based radiomics features were independently selected within each modality using statistical testing and LASSO under the IFSCI design, ensuring modality-specific interpretability before consistent integration. Support vector machine (SVM), random forest (RF), and…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5- —https://doi.org/10.13039/501100001809National Natural Science Foundation of China
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCholangiocarcinoma and Gallbladder Cancer Studies · Gallbladder and Bile Duct Disorders · Hepatocellular Carcinoma Treatment and Prognosis
Introduction
Intrahepatic cholangiocarcinoma (ICC) is a primary malignant tumor originating from the epithelial cells of bile ducts located above the second-order intrahepatic bile ducts. It represents the second most common hepatic malignancy, accounting for approximately 10% to 15% of primary liver cancers [1, 2]. Compared to hepatocellular carcinoma, ICC is relatively uncommon; however, epidemiological studies suggest that the global incidence and mortality rates of ICC are steadily increasing, despite variations across different regions and countries [3]. The clinical presentation of ICC is nonspecific, with most patients remaining asymptomatic during the early stages. As the disease progresses, patients may experience symptoms such as weight loss, abdominal discomfort, jaundice, hepatomegaly, and palpable abdominal masses. Current evidence indicates that radical surgical resection offers the best chance for long-term survival in ICC patients [4, 5]. Unfortunately, ICC is highly aggressive and associated with poor prognosis. A systematic review reported that the 5-year overall survival rate following radical resection rarely exceeds 30%–35%, with a median overall survival of approximately 28 months [6]. Therefore, timely identification of patients at high risk for early recurrence is critical to improving patient prognosis.
Recently, considerable attention has been given to identifying risk factors for early recurrence of ICC; however, there is currently no consensus on the definition of early recurrence. Previous studies commonly defined early recurrence as recurrence within two years post-surgery, a criterion similar to that used for hepatocellular carcinoma (HCC) [7, 8]. However, ICC generally carries a worse prognosis compared to HCC, with most recurrences occurring within the first two years after surgical resection. Therefore, defining early recurrence with a two-year threshold may not be appropriate for ICC patients [2]. A growing body of research indicates that recurrence within one year is strongly associated with poorer outcomes, suggesting that defining early recurrence as recurrence within one year might be more suitable for ICC patients [9–11]. Hence, accurately predicting recurrence within the first year after surgery holds significant importance for improving patient prognosis and guiding clinical decision-making.
With the continuous advancement of computer technologies, radiomics and machine learning have emerged as novel diagnostic tools in oncology, making prognostic prediction for ICC increasingly reliable [12]. Radiomics, as a research hotspot in imaging science, utilizes computer-assisted techniques to quantitatively extract subtle imaging features that are imperceptible to the human eye [13]. This approach allows for a shift from subjective interpretation to objective analysis, showing remarkable potential in predicting ICC prognosis based on imaging data [14]. Moreover, machine learning enables researchers to identify prognostically relevant features from complex clinical and pathological datasets, thereby facilitating outcome prediction for ICC patients [15]. Given that a single data modality may not fully capture the heterogeneity of ICC, integrating multimodal data may offer a more comprehensive solution. Machine learning models developed through the integration of radiomics, clinical, and pathological features can significantly enhance early recurrence prediction, yielding synergistic clinical benefits that surpass the value of any single modality alone [16].
However, most current studies suffer from limitations in their multimodal model construction strategies. Typically, features from various modalities—such as clinical, pathological, and radiomics—are combined first, followed by a unified feature selection process and model development [17]. This approach presents two major drawbacks: (1) The selected features in the final combined model may differ greatly—or even entirely—from those used in independently constructed unimodal models, thereby reducing both the comparability between unimodal and multimodal models and the interpretability of the combined model. (2) Imbalances in feature quantity across modalities—for instance, a large number of radiomics features versus relatively few clinical or pathological features—can result in the exclusion of clinically meaningful modality-specific features due to inter-feature interference during joint selection [18, 19].
Overall, such a strategy weakens both the interpretability of multimodal models and their comparability with unimodal counterparts. An alternative approach—independently selecting features within each modality and then consistently integrating them into the final combined model—may enhance both model interpretability and cross-model comparability.
Furthermore, it is important to note that machine learning faces an inherent limitation in clinical applications: it is often referred to as a “black-box algorithm” due to its lack of interpretability [20, 21]. This opaque computational process reduces the level of trust and acceptance among clinicians, thereby restricting the widespread adoption of machine learning in clinical practice. However, for complex clinical tasks such as tumor prognosis, accurate and transparent model interpretation is both crucial and highly challenging. Previous studies have attempted to translate machine learning results into clinical practice using tools such as nomograms, which help visualize model predictions and facilitate clinical application to some extent. Nevertheless, these tools do not fundamentally assist clinicians in understanding the decision-making process of machine learning models [15]. In recent years, the emergence of Shapley additive explanations (SHAP), an interpretation method grounded in game theory, has offered a novel solution to this challenge [22]. By calculating SHAP values, this method assigns to each feature its marginal impact on the prediction, clarifying its importance within the model and elucidating the logic behind its decision.
To address feature-interference in existing multimodal models, we propose the Independent Feature Selection and Consistent Integration (IFSCI) strategy. IFSCI independently screens and retains each modality’s key features, then consistently integrates them into a unified model to predict early recurrence risk in ICC patients, thereby enhancing model comparability and interpretability. Furthermore, we apply SHAP‐based explainability analysis at both global and local levels to dissect the model’s decision‐making process, identify the most influential features, and ultimately improve the model’s transparency, interpretability, and clinical credibility.
Materials and methods
We followed the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines to report the development and validation of the multivariable prediction model [23]. This retrospective study was approved by the ethics committee of hospital, with informed consent waived. The overall study flowchart is presented in Fig. 1.
Fig. 1. Overall study flowchart. ROI: Region of Interest. LASSO: Least Absolute Shrinkage and Selection Operator. SVM: Support Vector Machine. RF: Random Forest. MLP: Multilayer Perceptron. ROC: Receiver Operating Characteristic. DCA: Decision Curve Analysis
Patients and clinical characteristics
This multicenter cohort comprised ICC cases diagnosed between January 2014 and December 2023 at two Hospital. All patients underwent radical surgical resection, and histopathological analysis confirmed the diagnosis of ICC.
The inclusion criteria were as follows: (1) intrahepatic mass-forming cholangiocarcinoma with a solitary lesion larger than 1 cm in diameter; (2) radical surgical resection performed; (3) postoperative pathological confirmation of ICC with complete pathological data; and (4) All patients underwent both non‑contrast and contrast‑enhanced liver CT scans within one month prior to surgery. The exclusion criteria were: (1) incomplete clinical data; (2) presence of distant metastasis; (3) absence of qualified preoperative CT images; (4) receipt of arterial embolization or other interventional therapies before CT examination; and (5) lack of follow-up records.
Clinical data collected included age, sex, history of alcohol consumption, and presence of liver cirrhosis. Preoperative tumor markers such as serum AFP, CA19-9, CA125, and CEA were also recorded. Pathological data were retrospectively collected, including tumor differentiation grade, lymph node metastasis status, microvascular invasion, surgical margin status, and Ki-67 expression levels from immunohistochemical analysis.
Follow-up surveillance
All patients included in this study underwent radical surgical resection and were regularly followed up at the two participating medical centers. The study endpoint was the occurrence of early recurrence after radical resection for ICC, which was defined as recurrence within 12 months postoperatively. All enrolled patients followed a standardized postoperative monitoring protocol, with evaluations scheduled at 1, 3, 6, and 12 months post-surgery. Follow-up assessments included serological tumor markers such as CA19-9, AFP, and CEA, as well as abdominal ultrasound, contrast-enhanced CT, and contrast-enhanced MRI. The data were reviewed in December 2024.
CT image acquisition
CT images were acquired using multidetector spiral CT scanners. At Institution 1, the scanners included the Siemens SOMATOM Force, Siemens SOMATOM Definition Flash, GE LightSpeed VCT, and GE Revolution 256-slice spiral CT, while Institution 2 utilized the Siemens Sensation 64 CT and GE Discovery CT750. All patients underwent both non-contrast and contrast-enhanced liver CT scans within one month before surgery. The protocol involved an initial conventional plain scan of the liver, followed by contrast-enhanced scans.
Image segmentation and radiomics feature extraction
The CT images from the arterial and portal venous phases were imported into ITK-SNAP (version 3.8). Radiologist 1 (with 12 years of experience) manually delineated the tumor volumes of interest (VOIs) on the arterial and portal venous phase images, respectively. All VOIs were confirmed by a second radiologist (with 15 years of experience), and any disputes were resolved through discussion involving a third radiologist (with 24 years of experience). In addition, after more than 3 months, radiologist 1 randomly selected 20 patients again to delineated the ROI. When the extracted features intra-class correlation coefficients (ICC) were greater than 0.75, the radiomics feature extraction showed good consistency.
All CT images were standardized prior to radiomics extraction. The arterial- and portal-phase images were resampled to an isotropic voxel spacing of 1 × 1 × 1 mm³, the window width and level were harmonized across scanners. These preprocessing steps were applied uniformly to minimize scanner-dependent variability.
Quantitative radiomics features were extracted using the Python package pyradiomics. A total of 1,051 features were computed from the delineated VOIs based on CT images from both the arterial and portal venous phases. The selected variables comprised first‑order statistics and shape descriptors, along with texture metrics extracted from gray‑level co‑occurrence (GLCM), run‑length (GLRLM), size‑zone (GLSZM), dependence (GLDM), and neighboring gray‑tone difference (NGTDM) matrices, as well as higher‑order features.
Feature selection and model construction
Patient data from Center One (n = 196) were stratified using an 8:2 ratio, resulting in a training set of 157 cases and an internal validation set of 39 cases. Data from Center Two (n = 68) were used as an external test set. The radiologist performed two intragroup segmentations, and the resulting ICC values were greater than 0.75, indicating that the radiomics feature extraction had satisfactory consistency.
All feature-selection procedures and model training were performed exclusively within the training set to avoid information leakage. The internal validation set and the external test cohort were not involved in any step of feature selection or model training; instead, they served solely as independent datasets for model evaluation. This design ensured a fully leakage-free workflow and allowed unbiased assessment of model generalizability.
Clinical feature
For clinical features, categorical variables were compared between groups using the chi-square test, while continuous variables were assessed using either the Mann–Whitney U test or the independent-samples t-test. Only variables showing statistically significant differences between the two groups were subsequently incorporated into the model-building process.
Radiomics feature
Radiomics feature selection was performed using a structured, stage-wise reduction workflow to address the high dimensionality and multicollinearity inherent in the extracted features. First, all radiomics features were standardized using Z-score normalization. A Mann–Whitney U test was then employed as an initial univariate filtering step to remove features without detectable discriminatory ability between groups, thereby improving computational efficiency for subsequent multivariate procedures. Second, the Least Absolute Shrinkage and Selection Operator (LASSO) regression was applied to further refine the feature set by resolving multicollinearity. Through penalized coefficient shrinkage based on an optimized tuning parameter (λ), LASSO reduced the feature space to a subset with stable and non-redundant predictive contributions. In this workflow, LASSO served as an intermediate selection step to derive an interpretable, low-correlation feature group suitable for downstream model development. Third, SelectKBest was used to standardize the dimensionality of the final feature set. The number of retained features was set to K = 10 for each imaging phase, following the commonly recommended 10:1 sample-to-feature ratio in radiomics research. This design ensured an appropriate balance between feature complexity and available sample size, thereby reducing overfitting risk and enhancing model robustness.
Model construction
In this study, clinical pathological features and radiomics features were separately input into Support Vector Machine (SVM), Random Forest (RF), and Multilayer Perceptron (MLP) algorithms to construct clinical and radiomics models for predicting early recurrence of ICC. Subsequently, to enhance comparability and interpretability between unimodal and multimodal approaches, clinical and radiomics features were merged and fed into a single framework to assess the risk of early ICC recurrence.
Evaluation of model performance
The performance of the predictive models was evaluated using the area under the receiver operating characteristic curve (AUC), along with sensitivity, specificity, accuracy, Youden index, F1-score, and Brier score. Calibration curves and decision curve analysis (DCA) were employed to assess the stability and clinical benefit of the models, respectively. Threshold probabilities between 10% and 30% were selected because this range reflects clinically meaningful decision points in postoperative ICC management, where recurrence probabilities below 10% rarely alter follow-up strategies, whereas probabilities above 20–30% often trigger intensified surveillance or consideration of adjuvant treatment. The DeLong test was used to evaluate performance distinctions between models.
SHAP value: model interpretation based on game theory
To better understand and interpret the fundamental decision rules of the machine learning model, this study incorporated SHAP values to elucidate the decision-making process. SHAP, a post hoc explanation method grounded in game theory, computes the marginal contribution of each feature to the model output. Consequently, it provides clear and intuitive linear explanations for the impact of individual features, thereby enabling both global and local interpretability of the model [22].
Statistical analysis
Radiomics feature extraction was performed using the open-source pyradiomics package in Python (version 3.7). All data analyses were conducted using R software (version 4.3.3) and Python (version 3.7). A P value of < 0.05 was considered statistically significant.
Results
Patient
Of the 264 ICC cases enrolled, 196 came from Institution 1 (126 with recurrence, 70 without) and 68 from Institution 2 (46 with recurrence, 22 without) (Table 1, and Supplementary Materials 1).
Table 1. Clinical characteristics of ICC patients in the training setClinical characteristicsRecurrence (n = 126)No recurrence (n = 70)Statistics P Age (years)55.79 ± 10.7355.30 ± 10.53-0.8420.401Ki6737.759 0 ≤ 2027 (21.43%)46 (65.71%) >2099 (78.57%)24 (34.29%)Drinking history1.0840.298 Yes49 (61.11%)22 (31.43%) No77 (38.89%)48 (68.57%)Diabetes0.030.863 Yes21 (16.67%)11 (15.71%) No105 (83.33%)59 (84.29%)AST3.5670.059 ≤ 4069 (54.76%)48 (68.57%) >4057 (45.24%)22 (31.43%)ALT0.2460.62 ≤ 4071 (56.35%)42 (60.00%) >4055 (43.65%)28 (40.00%)GGT4.927 0.026 ≤ 6046 (36.51%)37 (52.86%) >6080 (63.49%)33 (47.14%)TBIL(umol/L)0.5280.467 0101 (80.16%)53 (75.71%) 125 (19.84%)17 (24.29%)CEA(ng/mL)12.492 0 < 571 (56.35%)57 (81.43%) ≥ 555 (43.65%)13 (18.57%)CA125(U/mL)3.0830.079 < 3582 (65.08%)54 (77.14%) ≥ 3544 (34.92%)16 (22.86%)CA199(U/mL)0.2610.609 < 3460 (47.62%)36 (51.43%) ≥ 3466 (52.38%)34 (48.57%)CA153(U/mL)6.827 0.009 < 2889 (70.63%)61 (87.14%) ≥ 2837 (29.37%)9 (12.86%)AFP(ng/mL)8.764 0.003 < 2093 (73.81%)64 (91.43%) ≥ 2033 (26.19%)6 (8.57%)Clonorchis sinensis infection2.2260.136 Yes69 (54.76%)24 (34.29%) No57 (45.24%)46 (65.71%)Peritumoral bile duct stones1.0430.307 Yes9 (7.14%)8 (11.43%) No117 (92.86%)62 (88.57%)Peritumoral bile duct dilatation0.0370.848 Yes63 (50.00%)34 (48.57%) No63 (50.00%)36 (51.43%)Liver capsular retraction0.0020.966 Yes68 (53.97%)38 (54.29%) No58 (46.03%)32 (45.71%)Vascular tumor thrombus0.0640.8 Yes40 (31.75%)21 (30.00%) No86 (68.25%)49 (70.00%)Enlarged lymph nodes1.2270.268 Yes58 (46.03%)38 (54.29%) No68 (53.97%)32 (45.71%)Tumor location2.6370.268 Perihilar type25 (19.84%)21 (30.00%) Peripheral type82 (65.08%)39 (55.71%) both perihilar and peripheral regions19 (15.08%)10 (14.29%)Tumor morphology0.530.467 Regular67 (53.17%)41 (58.57%) Irregular59 (46.83%)29 (41.43%)Maximum tumor diameter (cm)6.71 ± 3.116.00 ± 2.80-0.0620.951CT plain scan4.132 0.042 Homogeneous density22 (17.46%)21 (30.00%) Heterogeneous density104 (82.54%)49 (70.00%)Arterial phase characteristics0.5040.777 Isodensity or hypodensity44 (34.92%)21 (30.00%) Heterogeneous hyperdensity14 (11.11%)8 (11.43%) Rim-like hyperdensity68 (53.97%)41 (58.57%)Enhancement pattern6.080.108 Type 138 (30.16%)33 (47.14%) Type 248 (38.10%)21 (30.00%) Type 331 (24.60%)11 (15.71%) Type 49 (7.14%)5 (7.15%)Histological differentiation0.7570.384 Low53 (42.06%)25 (35.71%) Moderate - high73 (57.94%)45 (64.29%)CK190.0080.931 Negative2 (1.59%)1 (1.43%) Positive124 (98.41%)69 (98.57%)CK70.8570.355 Negative14 (11.11%)11 (15.71%) Positive112 (88.89%)59 (84.29%)P531.3340.248 Negative35 (27.78%)25 (35.71%) Positive91 (72.22%)45 (64.29%)Liver cirrhosis0.1570.692 Yes13 (10.32%)6 (8.57%) No113 (89.68%)64 (91.43%)MVI2.110.348 M075 (59.52%)41 (58.57%) M145 (35.71%)22 (31.43%) M26 (4.77%)7 (10.00%)Type 1: Typical rim enhancement with gradual inward filling; Type 2: Progressive hypoenhancement; Type 3: Arterial phase rim enhancement with portal phase washout; Type 4: Heterogeneous persistent hyperenhancement
Feature selection
For clinical pathological features, among the 196 patients from Institution 1, six clinical features—namely, Ki67, GGT, CEA, CA153, AFP, and CT plain scan—showed statistically significant differences with respect to the early recurrence of ICC. Thus, the clinical and combined models were each built upon these six features (Table 1).
Regarding radiomics features, an initial screening using the Mann–Whitney U test identified 88 arterial phase CT radiomics features and 123 portal venous phase CT radiomics features (p < 0.05). Subsequently, the LASSO was employed to further refine these features, ultimately selecting 23 arterial phase and 26 portal venous phase CT radiomics features. The LASSO coefficient path illustrating this process is shown in Fig. 2.
Fig. 2. Feature selection processes using LASSO and K-best methods. Panels (A–B) illustrate the LASSO process for feature selection in the arterial phase (A) and portal venous phase (B), respectively. Panels (C–D) display the K-best feature selection process for the arterial phase (C) and portal venous phase (D), respectively
Furthermore, given that the total number of cases in the training and test sets was 196, and to avoid overfitting, the SelectKBest was applied to further reduce the number of features, standardizing the total number of radiomics features to 20. Ultimately, the top 10 features from each of the arterial and portal venous phase feature sets were selected for constructing the radiomics and combined models (Fig. 2).
Performance comparison of different models
In this study, based on the 20 selected radiomics features and 6 clinical features, radiomics, clinical, and combined radiomics–clinical models were constructed using SVM, RF, and MLP algorithms (Table 2 and Supplementary Materials 2).
Table 2. Comprehensive feature list included in the modelTypes of featuresFeatures listRadiomic features in arterial phasewavelet-HLH glszm SizeZoneNonUniformitywavelet-HHH glcm Autocorrelationwavelet-HLL firstorder InterquartileRangewavelet-HLH firstorder Meanoriginal firstorder 10Percentilewavelet-LLL firstorder Minimumwavelet-LLL glrlm GrayLevelNonUniformityNormalizedwavelet-HLH glszm LowGrayLevelZoneEmphasiswavelet-HLH glszm SmallAreaHighGrayLevelEmphasislog-sigma-3-mm-3D glszm GrayLevelNonUniformityNormalizedRadiomic features of portal venous phasewavelet-HLH firstorder Medianwavelet-LLL glcm MaximumProbabilityoriginal shape Elongationlog-sigma-5-mm-3D glszm SmallAreaLowGrayLevelEmphasiswavelet-LLH glszm GrayLevelNonUniformityNormalizedwavelet-HLL glcm ClusterTendencyoriginal firstorder Minimumwavelet-HLL firstorder Medianwavelet-LLL glcm ldnlog-sigma-3-mm-3D glcm ClusterShadeClinical characteristicsKi67GGTCEACA153AFPCT plain scan
Regarding the clinical model, integrating six clinicopathological indicators, the predictive algorithm demonstrated strong accuracy in forecasting early ICC recurrence. Notably, the MLP algorithm achieved superior performance in the training set, with an AUC of 0.911 (95% CI: 0.855–0.950), compared to the SVM algorithm (AUC = 0.826, 95% CI: 0.758–0.882) and the RF algorithm (AUC = 0.815, 95% CI: 0.745–0.872). This performance trend was consistently observed in both the internal and external validation sets (Table 3).
Table 3. Model diagnostic performanceAlgorithmDatasetModelAUC (95% CI)SensitivitySpecificityComparison with combined modelSVMTraining setClinical model0.826 (0.758–0.882)84.1683.93Z = 1.693, P = 0.0904Radiomics model0.834 (0.767–0.889)87.1382.14Z = 1.312, P = 0.1895Combined model0.896 (0.837–0.939)88.1285.71Internal validation setClinical model0.771 (0.609–0.890)76.0078.57Z = 0.676, P = 0.4988Radiomics model0.811 (0.654–0.919)84.0078.57Z = 0.332, P = 0.7400Combined model0.851 (0.701–0.945)88.0085.71External validation setClinical model0.712 (0.590–0.816)76.0972.73Z = 1.112, P = 0.2662Radiomics model0.796 (0.681–0.884)80.4372.73Z = 0.194, P = 0.8464Combined model0.813 (0.700–0.897)84.7881.82RFTraining setClinical model0.815 (0.745–0.872)86.1473.21Z = 1.254, P = 0.2100Radiomics model0.853 (0.788–0.905)90.1080.36Z = 0.501, P = 0.6163Combined model0.867 (0.803–0.916)92.0880.36Internal validation setClinical model0.774 (0.612–0.892)84.0078.57Z = 0.413, P = 0.6796Radiomics model0.791 (0.631–0.905)88.0085.71Z = 1.342, P = 0.1795Combined model0.826 (0.671–0.928)88.0078.57External validation setClinical model0.716 (0.594–0.819)78.2672.73Z = 0.734, P = 0.4632Radiomics model0.772 (0.654–0.865)80.4372.73Z = 0.299, P = 0.7653Combined model0.801 (0.687–0.888)84.7881.82MLPTraining setClinical model0.911 (0.855–0.950)91.0991.07Z = 0.620, P = 0.5354Radiomics model0.925 (0.872–0.961)92.0894.64Z = 0.209, P = 0.8341Combined model0.933 (0.881–0.966)94.0692.86Internal validation setClinical model0.811 (0.654–0.919)84.0078.57Z = 0.728, P = 0.4668Radiomics model0.823 (0.667–0.926)84.0085.71Z = 0.614, P = 0.5395Combined model0.891 (0.750–0.968)88.0085.71External validation setClinical model0.753 (0.633–0.850)78.2672.73Z = 1.503, P = 0.1328Radiomics model0.811 (0.698–0.896)82.6181.82Z = 0.617, P = 0.5374Combined model0.856 (0.749–0.929)84.7881.82
The radiomics model was constructed by combining 10 arterial phase radiomics features with 10 portal venous phase radiomics features. The radiomics model surpassed the clinical model in performance across all three machine learning algorithms (SVM, RF, and MLP). Notably, the MLP algorithm achieved the best predictive performance, with AUCs of 0.925 (95% CI: 0.872–0.961) in the training set, 0.823 (95% CI: 0.667–0.926) in the internal validation set, and 0.811 (95% CI: 0.698–0.896) in the external validation set (Table 3).
In this study, a combined model was further developed by integrating 20 imaging features and 6 clinical features using three different machine learning algorithms (SVM, RF, and MLP) to predict early recurrence of ICC. Among the combined models constructed with these algorithms, diagnostic performance across training, internal validation, and external validation cohorts surpassed that of the unimodal clinical and radiomics models. Notably, the combined model developed using the MLP algorithm achieved the best predictive performance, with AUCs of 0.933 (95% CI, 0.881–0.966) in the training set, 0.891 (95% CI, 0.750–0.968) in the internal validation set, and 0.856 (95% CI, 0.749–0.929) in the external validation set (Table 3).
Evaluation and application of calibration curves, DCA
Calibration curves further demonstrated a high level of concordance between the predicted values and the observed outcomes, underscoring the reliability of the model’s predictions. Moreover, DCA revealed that, compared to solely empirical clinical judgment, the combined model conferred a significant net benefit for patients across a broader range of risk thresholds (Fig. 3).
Fig. 3. Composite figure of MLP diagnostic performance. Panels (A–C) display the ROC curves (AUC), Panels (D–F) show the calibration curves, and Panels (G–I) illustrate the decision curve analysis (DCA)
SHAP values: global and local interpretability
To enhance the interpretability of the model, we employed SHAP values to provide both global and local explanations for the MLP-based combined model that demonstrated the highest predictive performance. SHAP analysis was performed exclusively on the final 26 features. The SHAP bee-swarm plot offers a global interpretation by quantifying each feature’s contribution to the model output. In the plot, the vertical axis lists feature names in descending order of their importance to the model prediction. Each dot represents the specific value of the feature for an individual sample; the color intensity and horizontal position of the dot indicate the magnitude of the feature’s value and whether its impact on the prediction is positive or negative. Furthermore, the density of dots reflects the frequency of feature values within the dataset (Fig. 4).
Fig. 4SHAP bee swarm plot – global interpretation of the MLP-based combined model
In addition, SHAP values facilitate detailed local interpretation for individual cases. In this context, the length of each feature bar denotes the magnitude of its contribution to the prediction, while the direction of the bar indicates whether the feature increases (positive contribution) or decreases (negative contribution) the likelihood of early recurrence. The sum of all SHAP values for the features constitutes the overall SHAP total for that case; a positive total suggests that the model is inclined to predict a high risk of early recurrence, whereas a negative total indicates a low risk (Fig. 5).
Fig. 5. Local explanation provided by SHAP values for a specific case. Panels (A–C) display the plain scan, arterial phase, and portal venous phase images of an ICC patient with early recurrence. Panel (D) shows the SHAP waterfall plot for this patient, which indicates a positive cumulative SHAP value, suggesting that the model predicted early recurrence
Integrating SHAP values with the machine learning model facilitates clinicians’ understanding of the decision-making process from both global and local perspectives. This integration enhances the model’s transparency and clinical interpretability, thereby increasing the credibility and practical utility of its predictive outcomes. Ultimately, it contributes to more precise identification of ICC patients at high risk of early recurrence and provides robust evidence to support precision medicine in clinical practice.
Discussion
In this study, a machine learning model was established using an IFSCI approach. Separate clinical models, radiomics models, and a multimodal combined model were developed based on clinical-pathological features and radiomics features. The results demonstrated that both the clinical and radiomics models exhibited robust predictive performance for early recurrence in ICC, While fusing clinical and radiomic feature sets further improved predictive efficacy. Among the three machine learning algorithms evaluated, the MLP model showed superior predictive performance, a finding that was corroborated by the calibration curves and DCA. Overall, this method provided high inter-model comparability and interpretability, and the inclusion of SHAP values further facilitated both global and local interpretation of the machine learning models.
Recognizing the prognostic relevance of clinical features in ICC, we selected six clinical parameters for analysis—Ki67, GGT, CEA, CA153, AFP, and findings on plain CT scans—demonstrated significant correlations with early recurrence after ICC surgery, which aligns with the bulk of existing literature. Specifically, Ki-67 is a nuclear protein closely related to cellular proliferation, and its expression level is positively correlated with tumor aggressiveness. Several studies have identified Ki-67 as an independent predictor of poor prognosis in ICC [24, 25]. Moreover, Ki-67 is increasingly regarded as a potential therapeutic target in malignancies; for example, the CDK inhibitor Dinaciclib has been shown to suppress ICC growth by targeting and downregulating Ki-67 expression [26]. In addition, silencing the long non-coding RNA (lncRNA) CASC15 effectively reduces Ki-67 levels, thereby hindering ICC progression [27]. Gamma-glutamyl transpeptidase (GGT), a liver function marker, is closely linked to the occurrence and progression of hepatic tumors. Elevated GGT levels have been associated with vascular invasion, lymph node involvement, incomplete tumor encapsulation, and advanced ICC based on higher Child–Pugh grading. Furthermore, studies have demonstrated that patients with high GGT levels exhibit a median recurrence-free survival of only 6 months, compared to 12 months in patients with low GGT [28]. Carcinoembryonic antigen (CEA) is a widely used marker for gastrointestinal malignancies; its increased levels are significantly associated with higher tumor burden, metastasis, and worse prognosis in ICC [29]. Additionally, high expression levels of AFP and CEA in tumor tissues are significantly correlated with decreased survival in ICC patients [25]. CA153, a mucin-like glycoprotein broadly expressed in various cancers and linked to tumor cell proliferation and invasion, suggests a higher degree of malignancy; however, its prognostic value in ICC requires further investigation [30]. As a conventional marker for hepatocellular carcinoma, AFP has been reported to correlate with tumor invasiveness, differentiation, and poor prognosis in ICC patients. Elevated AFP is also strongly associated with microvascular invasion (MVI) in ICC [31], and serum AFP levels have been shown to independently predict overall survival (OS) in patients with HBV-related ICC [32]. Lastly, a heterogeneous density observed on plain CT scans is considered a high-risk factor for early ICC recurrence. This heterogeneous density reflects the complex tumor heterogeneity arising from high tumor aggressiveness, which leads to changes such as tumor necrosis and hemorrhage [33].
In this study, radiomic predictors of early ICC recurrence in this study were drawn primarily from first‑order statistics, shape, and texture descriptors. Specifically, first-order statistical features (e.g., Mean, Median, Interquartile Range) indicate variations in the distribution of gray-level intensities within the tumor, reflecting intratumoral density heterogeneity; texture features (e.g., those derived from GLCM, GLRLM, and GLSZM) elucidate the complexity of the tumor’s microstructure and suggest spatial heterogeneity; and shape features (e.g., Elongation) reveal irregularities or invasiveness in tumor growth [34, 35]. Overall, these features collectively capture the imaging characteristics of ICC, which include high-density heterogeneity, intricate internal structure, and irregular growth patterns, thereby showing a close association with the risk of early recurrence.
Early identification of adverse outcomes and the improvement of overall survival in cancer patients have long been key objectives for medical researchers. Thanks to comprehensive analyses of multimodal data, the early detection of high-risk ICC recurrence is now achievable [18]. Currently, methods for integrating multimodal data are generally categorized into early, intermediate, and late fusion strategies. Among these, early (data-level) fusion and feature-level fusion are widely favored because they impose lower requirements on the data and the models [36, 37]. Numerous studies have applied radiomics for predicting early postoperative recurrence in ICC, consistently demonstrating that multimodal combined models offer superior predictive performance. However, most studies tend to focus primarily on predictive accuracy while neglecting model interpretability. In many cases, a strategy of combining features before feature selection (i.e., early fusion) is used; this approach not only risks a dimensionality explosion but also results in significant discrepancies between the features utilized in unimodal models and those incorporated into the multimodal combined model, thereby making it challenging to isolate the individual impact of each unimodal feature. Consequently, this diminishes both the interpretability of the multimodal model and its comparability with unimodal models. The IFSCI strategy employed in this study effectively addresses these issues by first independently selecting features from each modality and then consistently integrating them. This not only enhances model interpretability but also improves comparability between multimodal and unimodal models. Moreover, this strategy provides a clearer foundation for model interpretation and clinical decision-making, thereby increasing clinicians’ trust in complex predictive models.
In terms of machine learning algorithms, we concurrently evaluated the performance of Multilayer Perceptron (MLP), Random Forest (RF), and Support Vector Machine (SVM). The MLP consistently demonstrated superior predictive performance across all models, which may be attributed to its strong nonlinear mapping capabilities and superior generalization performance [38]. Although RF and SVM are widely used in medical image analysis, the robust learning ability of MLP—derived from its artificial neural network architecture—renders it more effective in capturing the latent nonlinear relationships between clinical and radiomics features in complex datasets. In real-world scenarios, data variables often exhibit both linear and nonlinear associations. As a feedforward neural network, MLP consists of an input layer, one or more hidden layers, and an output layer, with each layer comprising multiple neurons; it employs nonlinear activation functions to perform complex feature transformations [39]. However, MLP’s high data requirements necessitate future validation on larger external cohorts to strengthen its generalizability and robustness.
The interpretability of machine learning models is paramount in the medical field, as it directly influences clinicians’ acceptance of predictive outcomes and the models’ clinical translatability. To enhance model interpretability, this study employed SHAP values for model explanation [22]. SHAP’s advantage lies in its game theory-based, post hoc interpretation approach, which can uncover complex patterns learned by predictive models, going beyond the limitations of simpler modeling methods. The integration of SHAP values not only enables visualization of the model prediction process but also provides patient-specific, localized explanations, thereby making it easier for clinicians to understand the rationale behind the model’s outputs. Overall, SHAP values effectively elucidate the workings of machine learning models and enhance their applicability in clinical practice. However, it should be noted that SHAP explains the patterns learned by the model rather than directly interpreting the underlying sample features; as a result, explaining outliers that deviate significantly from the overall trend remains challenging—a difficulty inherent to machine learning predictions. Future studies involving larger sample sizes for model training and the incorporation of deep learning-based interpretation methods may help overcome these limitations.
In summary, this study significantly improved the models’ predictive performance and interpretability by implementing an IFSCI strategy in conjunction with SHAP value integration. This approach elucidated the contribution of individual features within the multimodal framework for predicting early postoperative recurrence in ICC. Among the algorithms evaluated, the MLP model demonstrated superior predictive power, likely owing to its robust nonlinear mapping capability. However, One notable limitation is the study’s relatively small sample size, no comparison was made between different model interpretation methods, and variability in CT scanner acquisition parameters may introduce fluctuations in radiomic feature extraction. Future studies should recruit larger, multicenter cohorts and establish standardized imaging acquisition and preprocessing protocols to minimize heterogeneity, thereby facilitating clinical translation of the model and enabling precise postoperative management and individualized treatment for ICC patients. In addition, clinical variables were selected using univariate statistical testing rather than embedded or wrapper machine-learning methods. This choice was made to maintain clinical interpretability, although it does not account for multivariate interactions. A further limitation concerns the radiomics feature selection strategy. The Mann–Whitney U-test may exclude features that become informative only through multivariate interactions, and the fixed K value in the SelectKBest step may constrain the flexibility of dimensionality reduction. Future studies should incorporate sensitivity analyses using alternative selection approaches and varying K values to assess the robustness of the selected feature set. Future studies with larger cohorts will further explore ML-based feature-selection strategies.
Conclusions
The IFSCI approach combined with SHAP values facilitates the development of highly interpretable multimodal machine learning models for predicting early recurrence following ICC surgery. Compared to traditional unimodal prediction models, the multimodal combined model further enhances predictive performance, with the MLP algorithm demonstrating superior outcomes due to its robust nonlinear processing capabilities. Future research should focus on validation through multicenter studies and large sample cohorts to further promote clinical translation, ultimately advancing precise postoperative management and individualized treatment for ICC patients.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary Material 1
Supplementary Material 2
Supplementary Material 3
Supplementary Material 4
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Valle JW, Kelley RK, Nervi B, Oh DY, Zhu. AX. Biliary tract cancer. Lancet. 2021;397:428–44. 10.1016/S 0140-6736(21)00153-733516341 · doi ↗ · pubmed ↗
- 2Bekki Y, von Ahrens D, Takahashi H, Schwartz M, Gunasekaran G. Recurrent intrahepatic cholangiocarcinoma: review. Front Oncol. 2021;11. 10.3389/fonc.2021.776863 PMC 856713534746017 · doi ↗ · pubmed ↗
- 3Bray F, Laversanne M, Sung H, Ferlay J, Siegel R L, Soerjomataram I, Jemal A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2024;74:229–63.10.3322/caac.2183438572751 · doi ↗ · pubmed ↗
- 4Abou-Alfa GK, Sahai V, Hollebecque A, Vaccaro G, Melisi D, Al-Rajabi R, Paulson AS, Borad MJ, Gallinson D, Murphy AG, Oh DY, Dotan E, Catenacci DV, Van Cutsem E, Ji T, Lihou CF, Zhen H. Féliz L and Vogel A. Pemigatinib for previously treated, locally advanced or metastatic cholangiocarcinoma: a multicentre, open-label, phase 2 study. Lancet Oncol. 2020;21:671–84. 10.1016/S 1470-2045(20)30109-1PMC 846154132203698 · doi ↗ · pubmed ↗
- 5Zerunian M, Polidori T, Palmeri F, Nardacci S, Del Gaudio A, Masci B, Tremamunno G, Polici M, De Santis D, Pucciarelli F, Laghi A, Caruso D. Artificial intelligence and radiomics in cholangiocarcinoma: a comprehensive review. Diagnostics. 2025;15:148. 10.3390/diagnostics 15020148 PMC 1176377539857033 · doi ↗ · pubmed ↗
- 6Doussot A, Gonen M, Wiggers JK, Groot-Koerkamp B, De Matteo RP, Fuks D, Allen PJ, Farges O, Kingham TP, Regimbeau JM, D’Angelica MI, Azoulay D, Jarnagin WR. Recurrence patterns and disease-free survival after resection of intrahepatic cholangiocarcinoma: preoperative and postoperative prognostic models. J Am Coll Surg. 2016;223:493–505.e 2. 10.1016/j.jamcollsurg.2016.05.019PMC 500365227296525 · doi ↗ · pubmed ↗
- 7Li Q, Zhang J, Chen C, Song T, Qiu Y, Mao X, Wu H, He Y, Cheng Z, Zhai W, Li J, Zhang D, Geng Z, Tang Z. A nomogram model to predict early recurrence of patients with intrahepatic cholangiocarcinoma for adjuvant chemotherapy guidance: a multi-institutional analysis. Front Oncol. 2022;12:896764. 10.3389/fonc.2022.896764 PMC 925998435814440 · doi ↗ · pubmed ↗
- 8Yoh T, Hatano E, Seo S, Okuda Y, Fuji H, Ikeno Y, Taura K, Yasuchika K, Okajima H, Kaido T, Uemoto S. Long-term survival of recurrent intrahepatic cholangiocarcinoma: the impact and selection of repeat surgery. World J Surg. 2018;42:1848–56. 10.1007/s 00268-017-4387-729218465 · doi ↗ · pubmed ↗
