Development and Validation of the Early Gastric Carcinoma Prediction Model in Post-Eradication Patients with Intestinal Metaplasia
Wulian Lin, Guanpo Zhang, Hong Chen, Weidong Huang, Guilin Xu, Yunmeng Zheng, Chao Gao, Jin Zheng, Dazhou Li, Wen Wang

TL;DR
This study developed a machine learning model to predict early stomach cancer risk in patients who have been treated for a common stomach bacterium but still have abnormal stomach lining changes.
Contribution
The study introduces a validated machine learning model and web-based tool for predicting early gastric cancer in post-eradication patients with intestinal metaplasia.
Findings
The CatBoost algorithm achieved high accuracy in predicting early gastric cancer with an AUC of 0.905 in external validation.
The model outperformed traditional inflammatory biomarkers like NLR and PLR in risk discrimination.
A web-based calculator was developed to help doctors assess patient risk and improve early detection.
Abstract
Gastric cancer is one of the leading causes of cancer deaths worldwide. Although a common stomach bacterium can be treated with medicine, some patients still develop cancer even after treatment. This is especially true for people whose stomach lining has already changed in harmful ways. In this study, we used computer models to analyze medical records and endoscopy images from two hospitals to find patterns that might predict who is more likely to develop early stomach cancer. We created a simple online tool that doctors can use to calculate a patient’s risk. This can help identify high-risk patients earlier and make sure they receive the right follow-up care. Our goal is to improve early detection and save lives through better screening. Background: Gastric cancer (GC) remains a major global health challenge, with rising incidence among patients post-Helicobacter pylori (H. pylori)…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10- —Fujian Science and Technology Innovation Joint Funding Programme
- —Major Science and Technology Project of Fujian Province
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGastric Cancer Management and Outcomes · Helicobacter pylori-related gastroenterology studies · Cancer-related molecular mechanisms research
1. Introduction
Gastric cancer (GC) remains a major global health concern, ranking fifth in cancer incidence and fourth in cancer-related mortality worldwide. In 2020, more than one million new cases were diagnosed, leading to over 769,000 deaths [1,2]. While age-standardized incidence rates are declining in several high-risk regions such as East Asia, demographic transitions forecast a substantial rise in disease burden, with projected increases of 72.2% in incidence and 75.9% in mortality in Asia by 2040 [3,4]. Early detection is pivotal—five-year survival exceeds 90% for early gastric cancer (EGC) but drops below 30% once the disease is advanced [5]. Unfortunately, late diagnosis remains the norm in many settings; for example, in China, approximately 80% of patients are diagnosed at a locally advanced stage [6].
Persistent Helicobacter pylori (H. pylori) infection is the key driver of GC, initiating a cascade of chronic inflammation, atrophic gastritis, intestinal metaplasia (IM), and neoplasia [7]. The World Health Organization has classified H. pylori as a Group I carcinogen since 1994 due to its strong oncogenic potential. Eradication therapy has been shown to significantly reduce the risk of GC, with large-scale randomized trials and meta-analyses reporting a 39% relative risk reduction (RR = 0.61; 95% CI: 0.47–0.79) [8]. Despite this benefit, cancer risk is not eliminated after eradication, especially in patients with pre-existing mucosal damage such as severe atrophy or IM [9]. Furthermore, H. pylori eradication is often delayed in clinical practice, leaving ample time for irreversible histological progression to occur before treatment is initiated. Nonetheless, the risk of EGC persists even after successful eradication, particularly in patients with baseline mucosal damage such as atrophy or IM [9]. Surveillance strategies for this growing post-eradication population remain suboptimal, and risk prediction remains poorly defined [10,11].
Although endoscopic screening is the cornerstone of EGC detection, its sensitivity in the post-eradication setting is limited. Conventional white-light imaging often fails to identify subtle premalignant lesions, especially in patients with extensive IM or corpus-predominant atrophy. In a prospective multicenter study, 30.1% of patients developed map-like redness(MLR)—a surrogate for underlying IM—within one year of H. pylori eradication, predominantly in the corpus. Importantly, these lesions corresponded to areas of histologic IM that predated eradication, suggesting that endoscopic findings may lag behind histologic progression. High Kyoto classification scores and severe IM were strong predictors of these post-eradication abnormalities (OR = 8.144; 95% CI: 2.811–23.592) [12].
In parallel, machine learning (ML) techniques have been increasingly applied to cancer risk-modeling. A Korean nationwide study involving over 10 million individuals used SHAP-based interpretation to identify key risk factors for gastric cancer; however, model performance remained modest, with AUCs of 0.708 in internal and 0.669 in external validation [13]. Another study employing LASSO-XGBoost achieved high accuracy (AUC = 0.8937), but focused primarily on advanced-stage disease and postoperative survival, limiting its clinical applicability for early detection [14]. Moreover, most existing models rely on coarse clinical features and seldom incorporate endoscopic or histologic markers that are crucial for risk stratification in post-eradication populations.
Given these limitations, we conducted a dual-center retrospective study including patients from both 900 Hospital and Fujian Provincial People’s Hospital, aiming to develop and validate a machine learning-based model for predicting EGC in patients with IM following H. pylori eradication. By comparing its performance with conventional inflammatory and nutritional indices, and deploying it as a web-based risk calculator, we sought to improve individualized surveillance and facilitate EGC detection in this vulnerable population.
2. Materials and Methods
2.1. Study Design and Patient Population
This retrospective cohort study was conducted at two tertiary medical centers in China: the 900th Hospital of the PLA Joint Logistic Support Force and Fujian Provincial People’s Hospital. Clinical and endoscopic data were obtained from institutional endoscopy registries and electronic medical records spanning from January 2019 to December 2024. The study protocol was approved by the Institutional Review Board (IRB Number: Lun Shen Ke 2024-063), and written informed consent was obtained from all participants for the use of their clinical data for research purposes. The study adhered to the principles of the Declaration of Helsinki and followed the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines for reporting.
All enrolled patients received a standardized 14-day quadruple therapy regimen for Helicobacter pylori eradication. The regimen included esomeprazole 20 mg twice daily, clarithromycin 500 mg twice daily, amoxicillin 1 g twice daily, and colloidal bismuth pectin 200 mg three times daily, administered orally. This protocol is in accordance with current national guidelines for H. pylori eradication and has been widely validated in clinical practice. Treatment adherence and post-treatment eradication status were confirmed by a urea breath test performed at least four weeks after therapy completion.
Patients were eligible for inclusion if they: (1) had documented successful H. pylori eradication confirmed by negative a urea breath test, stool antigen test, or histological examination at least 6 months prior to enrollment; (2) had histologically confirmed intestinal metaplasia in at least one gastric biopsy specimen; and (3) underwent comprehensive endoscopic examination with standardized imaging and systematic biopsy protocol. Exclusion criteria comprised: (1) history of gastrectomy or endoscopic resection; (2) concurrent malignancy at any site; (3) severe comorbidities precluding endoscopic surveillance (ASA class ≥ III); (4) use of anticoagulants or antiplatelet agents that could not be discontinued before endoscopy; (5) incomplete medical records; and (6) poor-quality endoscopic images unsuitable for standardized assessment (Figure 1).
2.2. Data Collection
Comprehensive demographic, clinical, endoscopic, and laboratory data were extracted from the institutional electronic medical records system. Two experienced gastroenterologists (Lin Wulian. and Li Dazhou.), each with more than 10 years of experience in advanced endoscopy, independently reviewed all endoscopic images and reports. Any discrepancies were resolved through consensus with a third senior endoscopist (Wang Wen).
2.3. Demographic and Clinical Variables
Demographic data included age, sex, height, weight, and calculated body mass index (BMI). Clinical history variables encompassed family history of gastric cancer (first-degree relatives), smoking status, alcohol consumption, and history of H. pylori sterilization therapy (regimen and duration documented).
2.4. Endoscopic Assessment
All endoscopic examinations were performed using high-definition white-light endoscopy and LCI/ BLI (Fujifilm ELUXEO 7000 (Fujifilm, Tokyo, Japan) or equivalent) by certified endoscopists. The extent of atrophic gastritis was evaluated using the Kimura-Takemoto classification system, which categorizes atrophy into closed-type (C-1, C-2, C-3) and open-type (O-1, O-2, O-3) based on the location of the atrophic border. The atrophy range was then numerically converted (1–6) for statistical analysis, with higher scores indicating more extensive atrophy.
MLR, a characteristic post-eradication finding defined as well-demarcated reddish lesions with irregular margins resembling a geographical map, was documented for presence (yes/no), MLR range (percentage of gastric mucosa affected), and maximum size (in centimeters). The maximum MLR size exceeding 2 cm (maximumMLR2cm) was specifically recorded as a binary variable based on previous literature suggesting its potential predictive value. Xanthoma presence was defined as raised yellowish-white plaques on endoscopic examination. Reflux esophagitis (RE) was graded according to the Los Angeles classification system.
Standardized biopsy protocol followed the Sydney System guidelines with five biopsy sites (antrum greater and lesser curvature, incisura angularis, corpus greater and lesser curvature) plus targeted biopsies of any suspicious lesions. EGC was defined as adenocarcinoma confined to the mucosa or submucosa, irrespective of lymph node status, and was confirmed by two independent pathologists specializing in gastrointestinal malignancies.
2.5. Laboratory Parameters and Inflammatory Indices
Blood samples were collected prior to endoscopy. Complete blood count, comprehensive metabolic panel, coagulation profile, and tumor markers were measured using standardized laboratory methods. Specific hematological parameters included neutrophil count, lymphocyte count, platelet count, hemoglobin, albumin level, Absolute monocyte count (AMC), Absolute lymphocyte count (ALC), Red cell distribution wide (RDW), and prothrombin time (PT). Tumor markers were recorded as follows: Carcinoembryonic antigen (CEA), Carbohydrate antigen 19-9 (CA19-9), Carbohydrate antigen 72-4 (CA72-4).
The following inflammatory and nutritional indices were calculated:
- Neutrophil-to-lymphocyte ratio (NLR) [15] = neutrophil count/lymphocyte count
- Platelet-to-lymphocyte ratio (PLR) [15] = platelet count/lymphocyte count
- Lymphocyte-to-monocyte ratio (LMR) [15] = lymphocyte count/monocyte count
- Prognostic nutritional index (PNI) [15] = 10 × serum albumin (g/dL) + 0.005 × total lymphocyte count (per mm^3^)
- Systemic immune-inflammation index (SII) [15] = platelet count × neutrophil count/lymphocyte count
- Systemic inflammation response index (SIRI) [15] = neutrophil count × monocyte count/lymphocyte count
- Geriatric nutritional risk index (GNRI) [15] = 1.489 × albumin (g/L) + 41.7 × (weight/ideal weight)
- Hemoglobin, albumin, lymphocyte, and platelet score (HALP) [16] = hemoglobin (g/L) × albumin (g/L) × lymphocyte count/platelet count
- Platelet-to-albumin ratio (PAR) [16] = platelet count/serum albumin (g/L)
These indices were selected based on previous literature demonstrating their association with inflammation, nutritional status, and cancer risk in upper gastrointestinal disorders. All indices were calculated using laboratory values obtained during the same sampling period to ensure internal consistency.
2.6. Feature Selection and Engineering
Feature selection was performed using a multi-stage approach to identify the most predictive variables while minimizing multicollinearity. Initially, univariate analysis assessed the association between each potential predictor and the presence of EGC. Variables with p < 0.1 in univariate analysis were considered for further evaluation.
Correlation analysis was performed using Spearman’s rank correlation coefficient for continuous variables, with pairs showing |r| > 0.7 considered highly correlated. For highly correlated feature pairs, the variable with stronger univariate association with the outcome was retained. Boruta was then applied to identify the optimal feature subset, using the area under the receiver operating characteristic curve (AUC-ROC) as the performance metric.
Additionally, SHapley Additive exPlanations (SHAP) values were calculated to quantify the contribution of each selected feature to the model output. This approach allowed us to rank features based on their absolute SHAP values and select the optimal feature subset.
2.7. Model Development and Validation
The dataset was randomly split into training (70%) and validation (30%) sets, stratified by the presence of EGC to maintain class distribution. To address potential class imbalance, we employed the Synthetic Minority Over-sampling Technique (SMOTE) on the training set only, creating synthetic instances of the minority class to achieve balanced class distribution.
We systematically evaluated 21 ML algorithms with varying computational approaches and complexity levels, including:
- Tree-based methods: CatBoost, LightGBM, Random Forest, Extra Trees, Gradient Boosting, Decision Tree
- Ensemble methods: Bagging, AdaBoost
- Support vector machines: SVC (Polynomial kernel), SVC (Radial Basis Function kernel), Linear SVC
- Bayesian methods: Gaussian Naive Bayes, Bernoulli Naive Bayes
- Linear models: Quadratic Discriminant Analysis (QDA), Linear Discriminant Analysis (LDA), Ridge Classifier, Logistic Regression, Stochastic Gradient Descent (SGD) Classifier
- Neural networks: Multi-Layer Perceptron (MLP)
- Instance-based methods: K-Nearest Neighbors (k = 3), K-Nearest Neighbors (k = 5)
Hyperparameter optimization was conducted using Bayesian optimization with five-fold cross-validation, allowing 100 iterations to identify optimal parameter configurations for each algorithm. The search space for hyperparameters was defined based on established literature and computational constraints.
Model performance was evaluated using the AUC-ROC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy. Additionally, the area under the precision–recall curve (AUC-PR) was calculated to account for potential class imbalance. Confidence intervals (95% CI) for all metrics were derived using 1000 bootstrap resamples.
2.8. Model Comparison and Ensemble Creation
The best-performing algorithm based on validation set AUC-ROC was selected as the primary prediction model, designated as the Early Gastric Cancer Model (EGCM). For comprehensive comparison, we developed nine simplified models based on established inflammatory and nutritional indices: (1) NLR model, (2) PNI model, (3) PLR model, (4) SII model, (5) SIRI model, (6) GNRI model, (7) HALP model, (8) LMR model, and (9) PAR model. These comparisons were performed using DeLong’s test for correlated ROC curves.
2.9. Calibration and Decision Curve Analysis
Model calibration was assessed using calibration plots, comparing predicted probabilities against observed event rates across deciles of predicted risk. The Hosmer–Lemeshow test and Brier score were calculated to quantify calibration quality. Additionally, decision curve analysis (DCA) was performed to evaluate the clinical utility of the model across a range of decision thresholds, measuring the net benefit of using the model compared to strategies of treating all patients or no patients.
2.10. Interpretability Analysis
To enhance clinical interpretability. SHAP summary and dependency plots were created to illustrate both global and local feature importance.
2.11. Web-Based Calculator Development
A user-friendly web-based calculator was developed using the Flask framework (Python) with a responsive HTML frontend. The application was designed to accept input of the key predictive variables identified by the model, process them using the pre-trained prediction algorithm, and return the estimated probability of EGC. The calculator incorporates data validation, normalization according to the training dataset parameters, and visualization of individual risk factors. The web application was deployed on a secure server with SSL encryption to ensure data privacy and underwent usability testing with a panel of five experienced gastroenterologists.
2.12. Statistical Analysis
All statistical analyses were performed using R version 4.2.0 (R Foundation for Statistical Computing, Vienna, Austria) and Python 3.9 with scikit-learn 1.0.2, XGBoost 1.5.1, and SHAP 0.40.0 packages. Continuous variables were reported as mean ± standard deviation or median (interquartile range) based on distribution normality assessed by the Shapiro–Wilk test. Categorical variables were presented as counts and percentages. Between-group comparisons were conducted using Student’s t-test or Mann–Whitney U test for continuous variables and the chi-square test or Fisher’s exact test for categorical variables, as appropriate.
For comparative analysis of inflammatory indices, we constructed receiver operating characteristic (ROC) curves for each index and calculated the corresponding AUC values with 95% confidence intervals. Pairwise comparisons of AUC values were performed using DeLong’s method with Bonferroni correction for multiple comparisons.
All statistical tests were two-sided, with p < 0.05 considered statistically significant. p-values were adjusted for multiple comparisons using the Benjamini–Hochberg procedure to control the false discovery rate.
3. Result
3.1. Basic Information of All Patients
A total of 214 patients were included in this multicenter cohort, comprising 126 non-EGC and 88 EGC cases. The dataset was divided into internal training (n = 107), internal testing (n = 44), and external testing (n = 63) sets. Across all cohorts, 77.1% were male, and the median age was 62 years. Key endoscopic features such as map-like redness (MLR) and xanthoma were observed in 61.7% and 35% of patients, respectively. The atrophy range had a median value of 3 [IQR: 2–4], while the MLR range was 2 [IQR: 0–3]. The median H. pylori eradication time was 3 years [IQR: 2–7]. There were no statistically significant differences (all p > 0.05) among the internal training, internal testing, and external testing cohorts in any demographic, endoscopic, or laboratory variables, indicating comparability across the datasets. Full variable distributions are detailed in Table 1.
3.2. Information in Training Set
In our analysis of variables associated with EGC in the training set (Table 2), several factors demonstrated statistically significant differences between affected and non-affected patients. Demographically, male patients exhibited a significantly higher prevalence of EGC compared to females (88.9% vs. 11.1%, p = 0.047), and affected individuals were notably older (66.71 ± 7.64 vs. 56.69 ± 10.48 years, p < 0.001). Endoscopic findings revealed particularly strong associations, with MLR present in 97.8% of EGC cases versus 40.3% in the non-cancer group (p < 0.001), and MLR exceeding 2 cm diameter more frequently observed in the cancer group (57.8% vs. 22.6%, p < 0.001). Similarly, xanthoma was significantly more prevalent among EGC patients (66.7% vs. 16.1%, p < 0.001), while quantitative endoscopic parameters including atrophy range (4.29 ± 1.16 vs. 2.53 ± 0.86, p < 0.001) and MLR range (2.93 ± 1.23 vs. 1.02 ± 1.29, p < 0.001) were markedly elevated. These findings suggest that specific clinical, endoscopic, and serological parameters may serve as valuable indicators for EGC detection in clinical practice. In internal training cohorts, we recorded the time interval between H. pylori eradication and the date of endoscopic diagnosis, referred to as eradication time. Comparative analysis revealed no statistically significant difference in eradication time between the EGC and non-EGC groups (t p = 0.057), suggesting that eradication duration alone may not be a strong predictor of early gastric cancer in this selected population.
3.3. Machine Learning Model Performance in EGC Discrimination
In our evaluation of ML models for EGC prediction, the top-performing algorithms demonstrated moderate to good discriminative ability. Among the 21 models tested, CatBoost achieved the highest predictive performance with identical test AUCs of 0.754. LightGBM ranked third with an AUC of 0.741 and sensitivity matching CatBoost (0.722) but with lower specificity (0.615). Bagging and AdaBoost classifiers followed closely with AUCs of 0.739 and 0.737, respectively, both achieving identical performance metrics (accuracy: 0.704, sensitivity: 0.666, specificity: 0.730, PPV: 0.631, NPV: 0.76). Notably, while CatBoost demonstrated the best balance between sensitivity and specificity, ensemble methods generally outperformed individual classifiers, suggesting that combining multiple algorithms may enhance the predictive capability for EGC detection. The relatively modest AUC values across all models indicate that further refinement of feature selection and algorithm optimization may be necessary for clinical implementation (Showed as Table 3 and Figure 2).
3.4. Feature Selection
Variable importance was assessed using the Boruta algorithm on the top 20 ranked variables. The ten most important variables (atrophy range, xanthoma, MLR, MLR range, age, maximum MLR > 2 cm, CA72–4, Hemoglobin, male, CEA) identified are depicted in Figure 3A. Furthermore, SHAP analysis was performed to visualize the contribution of each variable to the model’s output (Figure 3B). To determine the optimal number of variables for model construction, we evaluated the AUC values across models with varying numbers of predictors (Figure 3C). Although the model including nine variables exhibited a slightly higher AUC, we ultimately selected a five-variable model comprising atrophy range, xanthoma, MLR, MLR range, and age. This decision was based on a balance between predictive performance and clinical applicability. The five-variable model achieved satisfactory discriminative power while maintaining parsimony and interpretability, which are essential for real-world implementation. These variables were also consistently ranked among the most important features by both Boruta and SHAP analyses. Therefore, the final predictive model for EGC—termed the Early Gastric Cancer Model (EGCM)—was constructed using these five features. Representative endoscopic findings for MLR and xanthoma are illustrated in Figure 4.
3.5. ROC
To evaluate the diagnostic performance of the proposed EGCM, we compared its classification metrics with those of conventional inflammation- and nutrition-based indices across the training, internal validation, and external validation cohorts (Table 4). In the training set, EGCM demonstrated outstanding discrimination with an AUC of 0.943 (95% CI: 0.900–0.987), sensitivity of 0.888, specificity of 0.919, and accuracy of 0.906. In contrast, all comparative indices (NLR, PLR, SII, SIRI, LMR, PAR) exhibited poor performance, with AUCs close to 0.5, sensitivities of 0, and specificities of 1.0, reflecting high false-negative rates. Among them, GNRI performed relatively better (AUC: 0.597; 95% CI: 0.504–0.690), albeit with a sensitivity of only 0.155. In the internal test set, EGCM retained strong performance, achieving an AUC of 0.743 (95% CI: 0.614–0.872), with sensitivity, specificity, and accuracy of 0.555, 0.692, and 0.636, respectively. Among traditional indices, only PNI (AUC: 0.626), SIRI (AUC: 0.613), and GNRI (AUC: 0.559) showed marginal predictive capacity, whereas the remaining models had AUCs < 0.53 and zero sensitivity. Importantly, in the external validation cohort (Fujian Provincial People’s Hospital), EGCM achieved a high AUC of 0.905 (95% CI: 0.832–0.977), with sensitivity of 0.76, specificity of 0.894, and accuracy of 0.841, demonstrating strong generalizability across independent populations. In contrast, all traditional indices failed to exceed AUCs of 0.57, and again showed poor sensitivity (0 for most). These results, visualized in Figure 5, underscore the superior discriminative ability and clinical utility of the EGCM compared to established indices across multiple datasets.
3.6. PR Curve
To further evaluate the discriminative performance of the EGCM, precision–recall (PR) curves were analyzed across the training, internal validation, and external validation cohorts (Figure 6A–C). In the training set (Figure 6A), the EGCM demonstrated excellent classification performance, characterized by a high and well-shaped PR curve. This reflects the model’s strong ability to maintain high precision while achieving robust recall in identifying early gastric cancer (EGC) cases. In comparison, all conventional inflammation- and nutrition-based indices (e.g., NLR, PLR, SII, SIRI, GNRI, PNI, LMR, HALP, PAR) showed PR curves that remained close to the baseline, indicating limited ability to detect true positives and a tendency toward high false-negative rates. In the internal validation set (Figure 6B), the EGCM maintained a markedly superior PR curve, though with a moderate attenuation in performance compared to the training set—an expected outcome during validation. The model still achieved a favorable balance between precision and recall, supporting its reliability and generalizability within the same institutional cohort. Crucially, in the external validation cohort (Figure 6C), the EGCM retained notably better discriminative performance than all other indices. The PR curve remained well above those of traditional models, reaffirming the EGCM’s robust predictive utility in an independent patient population. Conventional indices, once again, failed to contribute meaningfully, with their PR curves exhibiting near-baseline performance.
3.7. Calibration Curve
In addition to discrimination, calibration analysis was performed to evaluate the agreement between predicted probabilities and observed outcomes across three datasets. As shown in Figure 7 and Table 5, the EGCM exhibited excellent calibration in both the internal and external validation cohorts. In the training set, EGCM achieved an exceptionally low Brier score of 0.001, indicating minimal prediction error, and a Hosmer–Lemeshow (HL) p-value of 0.999, suggesting an excellent goodness-of-fit. The calibration curve (Figure 7A) closely aligned with the 45-degree reference line, visually confirming the strong concordance between predicted and actual risks. In the internal test set, the model maintained good calibration performance with a Brier score of 0.002 and a non-significant HL p-value of 0.261, supporting its generalizability within the same center. The calibration plot (Figure 7B) showed consistent alignment with the ideal line. Importantly, the EGCM also demonstrated reliable performance in the external test cohort, with a Brier score of 0.038 and HL p-value of 0.285, indicating acceptable calibration even in an independent patient population. The calibration plot in Figure 7C confirmed the model’s robustness across centers. In contrast, all comparative indices—including NLR, PNI, PLR, SII, SIRI, GNRI, HALP, LMR, and PAR—exhibited significantly worse calibration metrics. These models had substantially higher Brier scores (up to 0.461) and statistically significant HL p-values (<0.001) in both internal and external validation, indicating poor predictive reliability. Their calibration curves deviated markedly from the reference line, reflecting poor agreement between predicted and actual risks.
3.8. DCA
To assess the clinical utility of the Early Gastric Cancer Model (EGCM), we performed decision curve analysis (DCA) alongside reclassification metrics across three cohorts: internal training, internal validation, and external validation sets. As shown in Figure 8 and Table 6, the EGCM consistently demonstrated the highest net clinical benefit across a wide range of threshold probabilities in all datasets. In the training cohort (Figure 8A), EGCM outperformed all conventional inflammatory and nutritional indices, including NLR, PNI, PLR, SII, SIRI, GNRI, HALP, LMR, and PAR. The DCA curve of EGCM remained distinctly above both the “treat-all” and “treat-none” strategies. Quantitatively, all comparator models showed significantly negative NRI values (e.g., NLR: −0.808, PNI: −0.808) and IDI values (e.g., GNRI: −0.571, HALP: −0.590) with p-values < 0.001, indicating inferior reclassification and discrimination performance. In the internal validation cohort (Figure 8B), EGCM again exhibited superior net benefit compared to other models, although the performance margin narrowed. All indices showed consistently negative IDI values (range: −0.243 to −0.263), and most NRI values were also negative, confirming suboptimal risk classification (e.g., PLR: NRI = −0.247, p = 0.0768). Notably, in the external validation cohort, EGCM maintained its clinical advantage. As shown in Figure 8C, its DCA curve remained highest across clinically relevant thresholds. All alternative indices showed significantly negative NRI values (e.g., NLR: −0.654, PNI: −0.654) and IDI values (e.g., GNRI: −0.460, HALP: −0.478) with p < 0.001, reaffirming the consistent superiority of EGCM in external testing.
3.9. Presentation of Various Predictive Scenarios
Figure 9 presents a detailed analysis of different patient cases using feature values and model outputs, offering insights into the performance of a predictive model in the context of EGC. Each sub-figure represents a distinct patient scenario, highlighting the relationship between feature values and the model’s predicted outcome. In Figure 9A, we have a True Negative case. The patient’s feature values are provided, such as “range:3”, “xanthoma:0”, “MLR:1”, “MLRrange:3”, and “age:62”. The model output f(x) = −0.445, and the equation for calculating the model output seems to be a weighted sum of feature values, for example, terms like “1 = MLR + 0.76”, “3 = MLRrange + 0.39”. The expected value E[f(X)] = −1.006, and the relatively negative value of f(x) indicates that the model correctly predicts this case as negative, aligning with the actual outcome. Figure 9B shows a False Negative case. Here, with feature values like “range’:2”, “xanthoma’ 0”, “MLR:1”, “MLRrange:2”, and “age:61”, the model output f(x) = −1.824. Although the model classifies this as a negative case, it is actually a positive case in reality. The large negative value of f(x) compared to the True Negative case in sub-figure A might be due to the combination of feature values, suggesting that the model fails to accurately identify this positive case. Figure 9C represents a False Positive case. The patient has feature values “range: 5”, “anthoma”: 0”, “MLR:1”, “MLRrange:5”, and “age:65”, and the model output f(x) = 2.365. The positive value of f(x) leads the model to predict this as a positive case, while it is actually negative. The large positive value of f(x) could be a result of the specific combination of feature values, causing the model to misclassify. Figure 9D is another True Negative case with feature values “range:3”, “xanthoma: 0”, “MLR: 1”, “MLRrange’:3”, and “age:62”, similar to sub-figure A. The model output f(x) = −0.445 and E[f(X)] = −1.006, correctly indicating a negative outcome. Overall, Figure 8 visually demonstrates how the model performs differently for various patient cases based on their feature values. The False Negative and False Positive cases highlight potential areas where the model can be improved, while the True Negative cases show the model’s correct classification ability. Understanding these differences can help in refining the predictive model for more accurate EGC diagnosis.
3.10. Web Calculate
In the realm of early gastric carcinoma research, a novel web-based calculator has been developed to streamline risk assessment, and its interface is presented in Figure 10. This calculator, accessible at https://ktdi3dqqj68uwpu4x9odw9.streamlit.app/ (accessed on 2 June 2025), offers a user-friendly platform for estimating the probability of early gastric carcinoma. Figure 10A shows the default view of the calculator. It features input fields for “Range Value”, “MLR Range Value”, “MLR”, “Age (years)”, and “Xanthoma Present”. Initially, these fields are set to default values like 0 for numerical inputs and “No” for the “Xanthoma Present” option. This clean and intuitive starting state allows users, potentially medical professionals or researchers, to easily input patient-specific data. Upon inputting relevant patient information, as demonstrated in Figure 10B, the calculator generates a risk prediction. For instance, when “MLR” is set to 3, “Xanthoma Present” is toggled to “Yes”, and the “Age (years)” is entered as 65, the calculator computes a predicted probability of early gastric carcinoma. In this example, the result indicates a 75.2% probability, categorizing the patient as being at “High Risk”.
4. Discussion
In this multicenter retrospective study involving H. pylori-eradicated patients with histologically confirmed IM from two independent institutions, we developed and externally validated a robust machine learning-based predictive model—EGCM—using the CatBoost algorithm. Based on comprehensive feature selection incorporating Boruta and SHAP analyses, five clinically accessible predictors were identified: atrophy range, xanthoma, MLR, MLR range, and age. The EGCM demonstrated consistently strong diagnostic performance, achieving an AUC of 0.743 in the internal validation cohort and 0.905 in the external validation cohort, along with excellent calibration and superior clinical net benefit compared to traditional inflammation- and nutrition-based indices. To facilitate clinical implementation, the model was deployed as a user-friendly online risk calculator, offering individualized risk estimation to optimize surveillance strategies for EGC in this high-risk population.
Gastric mucosal atrophy emerged as a central risk factor for EGC, consistent with previous findings. Adachi et al. demonstrated that greater endoscopic atrophy significantly predicted GC risk in post-eradication patients [17]. Similarly, Kuraoka et al. reported that patients with severe atrophy and no prior eradication had a higher frequency of elevated-type GC, supporting the concept of a persistent carcinogenic field [18].
Among endoscopic features, MLR showed a particularly strong association with EGC risk. Matsumoto et al. found MLR in 25.3% of patients one year post-eradication, with higher odds in those with IM (OR = 2.794, 95% CI: 1.155–6.757) and acid inhibitor use (OR = 1.948, 95% CI: 1.070–3.547); MLR itself was associated with GC (OR = 2.432, 95% CI: 1.264–4.679) [19]. In a multicenter prospective study, the MLR rate reached 30.1%, and all patients with MLR had pre-existing IM at corresponding sites; IM remained significantly associated with MLR (OR = 8.144, 95% CI: 2.811–23.592) [12]. Our MLR incidence (20%) was slightly lower, likely due to shorter surveillance intervals or lower baseline mucosal severity. Crucially, unlike previous binary MLR classifications, our model quantifies the MLR extent, which may explain its stronger predictive power.
Long-term data reinforce MLR’s carcinogenic significance. Iwata et al. observed increasing MLR prevalence from 3.6% to 18.7% over 15 years (p = 0.03) [9], highlighting the need for prolonged surveillance. Moreover, MLR-associated EGCs often present as reddish depressed lesions. Tahara et al. reported magnifying endoscopy with narrow-band imaging (ME-NBI) achieved 93.9% diagnostic accuracy in distinguishing neoplastic from benign lesions [20].
GX also emerged as a significant endoscopic biomarker. Shen et al. identified GX as an independent risk factor for both precancerous lesions (OR = 3.197, 95% CI: 2.791–3.662) and gastric cancer (OR = 1.794, 95% CI: 1.394–2.309) [21]. Gao et al. found higher GX prevalence in patients with precancerous lesions (14.9%) and GC (19.8%) compared to chronic gastritis (6.2%) [22]. Feng et al. further demonstrated that GX correlates with atrophic gastritis (OR = 1.83), IM (OR = 2.42), and H. pylori infection (OR = 1.32); multiple GXs were associated with a higher burden of precancerous changes, suggesting a dose-dependent relationship [23].
Age was another independent predictor of EGC. Our findings of increasing age-related risk align with those of Iwata et al. [9] and Wei et al. [24], who reported that severe atrophy (OR = 2.71) and IM (OR = 5.0, p < 0.001) predicted post-eradication EGC. Adachi et al. noted that 47.5% of elderly patients showed no regression of atrophy after eradication [17], underscoring persistent mucosal risk. Matsushima et al. found that individuals ≥80 years old accounted for 50% of GC-related deaths despite eradication, emphasizing the need for vigilant endoscopic follow-up in this population [25]. Conversely, Jung et al. reported reduced GC incidence in patients ≥ 70 after eradication (SIR = 0.56, 95% CI: 0.52–0.61), though their study focused on primary prevention [26]. The steeper risk gradient in our cohort may reflect inclusion of patients with baseline metaplasia undergoing post-eradication surveillance.
The EGCM outperformed traditional biomarkers such as NLR, which, although statistically associated with GC (OR = 1.38, 95% CI: 1.04–1.83) [17], offers limited predictive utility alone. By integrating demographic, histologic, and endoscopic parameters—particularly MLR range and GX—the EGCM reflects the multifactorial nature of gastric carcinogenesis. The inclusion of corpus-predominant atrophy and advanced endoscopic signs aligns with recent data demonstrating their superior prognostic value [27,28].
In-depth analysis of model outputs, as illustrated in Figure 9, revealed the presence of both false positive and false negative cases, underscoring the need for continuous refinement of the EGCM. False negatives are particularly concerning in the context of EGC surveillance, as they may lead to missed opportunities for early intervention. For example, in Figure 9B, despite moderate-risk feature values, the model underestimated the malignancy risk, highlighting the limitations of current feature representation in capturing subtle disease signals. Conversely, false positives, such as in Figure 9C, may result in unnecessary anxiety or invasive procedures for low-risk individuals. These misclassifications suggest that the model, while robust overall, may benefit from incorporating additional predictive dimensions—such as mucosal texture, immune status, or metabolic profiles—to enhance its discriminatory power. Future iterations of the model should prioritize reducing these critical errors through expanded datasets, inclusion of multimodal biomarkers, and ongoing prospective validation in diverse populations.
5. Limitations
Several limitations warrant consideration. The retrospective, single-center design may limit generalizability and introduce selection bias. Although expert endoscopists performed consensus readings, interobserver variability in assessing atrophy, xanthoma, and MLR could affect reproducibility. We lacked an external validation set from an independent institution; thus, EGCM’s performance in diverse ethnic and geographic populations remains to be confirmed. Finally, unmeasured factors—such as dietary patterns, genetic polymorphisms, and microbiome alterations—may further modulate EGC risk but were not captured in our dataset.
6. Future Directions
Prospective, multicenter validation of EGCM is essential to establish its generalizability and clinical utility across varying practice settings. Incorporation of advanced imaging modalities (e.g., narrow-band imaging, confocal laser endomicroscopy) and emerging molecular biomarkers (e.g., DNA methylation signatures [29], microRNA profiles [30]) could augment model precision [31]. Additionally, longitudinal studies assessing dynamic risk changes with repeat endoscopy and laboratory assessments will clarify the optimal surveillance cadence [32,33]. Finally, health economic analyses should evaluate the cost-effectiveness of EGCM-guided surveillance pathways compared to standard care.
Despite the promising performance of our EGCM, we acknowledge that direct comparison with previously published machine learning (ML) models for early gastric cancer (EGC) prediction remains limited. This is primarily due to differences in target populations, model endpoints, and availability of variables. Most existing ML-based gastric cancer models have focused on advanced-stage disease, postoperative outcomes, or general cancer risk stratification, often relying on features such as imaging, genomic profiles, or metabolic panels that were not available in our dataset.
Furthermore, recent studies have highlighted the potential value of integrating metabolic and immune system biomarkers—including cytokines, chemokines, microbiome signatures, and host gene expression—in gastric cancer risk prediction [34,35]. These multimodal biological indicators could help capture the tumor microenvironment and systemic host responses more accurately. Future studies should therefore explore the integration of these multi-omic and systemic inflammatory features into ML-based prediction tools to enhance accuracy and biological interpretability, especially for the post-H. pylori eradication population.
7. Conclusions
We developed and externally validated a robust, interpretable machine learning model—Early Gastric Cancer Model (EGCM)—to predict the risk of EGC in H. pylori-eradicated patients with IM. Using data from two independent medical centers, EGCM was built upon five accessible clinical and endoscopic features: atrophy range, xanthoma, MLR, MLR range, and age. The model demonstrated superior discrimination and calibration compared to conventional inflammatory and nutritional indices across both internal and external cohorts. Decision curve analysis confirmed its clinical utility, and a web-based calculator was deployed to facilitate individualized risk estimation. EGCM provides a practical tool for identifying high-risk individuals, potentially guiding more effective surveillance strategies and improving early detection outcomes. While our current model has shown promising performance, future studies should aim to directly or indirectly compare EGCM with other published machine learning-based prediction tools for EGC. In addition, incorporating metabolic, immunologic, and multi-omic data could further enhance the model’s predictive accuracy, biological interpretability, and clinical applicability, ultimately contributing to more effective risk stratification and personalized surveillance strategies.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Xia J.Y. Aadam A.A. Advances in screening and detection of gastric cancer J. Surg. Oncol.20221251104110910.1002/jso.2684435481909 PMC 9322671 · doi ↗ · pubmed ↗
- 2Sundar R. Nakayama I. Markar S.R. Shitara K. van Laarhoven H.W.M. Janjigian Y.Y. Smyth E.C. Gastric cancer Lancet 20254052087210210.1016/S 0140-6736(25)00052-240319897 · doi ↗ · pubmed ↗
- 3Tan N. Wu H. Cao M. Yang F. Yan X. He S. Cao M. Zhang S. Teng Y. Li Q. Global, regional, and national burden of early-onset gastric cancer Cancer Biol. Med.20242166767810.20892/j.issn.2095-3941.2024.015939109684 PMC 11359495 · doi ↗ · pubmed ↗
- 4Mousavi S.E. Ilaghi M. Elahi Vahed I. Nejadghaderi S.A. Epidemiology and socioeconomic correlates of gastric cancer in Asia: Results from the GLOBOCAN 2020 data and projections from 2020 to 2040 Sci. Rep.202515652910.1038/s 41598-025-90064-639988724 PMC 11847935 · doi ↗ · pubmed ↗
- 5Fu X.Y. Mao X.L. Chen Y.H. You N.N. Song Y.Q. Zhang L.H. Cai Y. Ye X.N. Ye L.P. Li S.W. The Feasibility of Applying Artificial Intelligence to Gastrointestinal Endoscopy to Improve the Detection Rate of Early Gastric Cancer Screening Front. Med.2022988685310.3389/fmed.2022.886853 PMC 915017435652070 · doi ↗ · pubmed ↗
- 6Bao Z. Jia N. Zhang Z. Hou C. Yao B. Li Y. Prospects for the application of pathological response rate in neoadjuvant therapy for gastric cancer Front. Oncol.202515152852910.3389/fonc.2025.152852940291912 PMC 12021903 · doi ↗ · pubmed ↗
- 7Chivu R.F. Bobirca F. Melesteu I. Patrascu T. The Role of Helicobacter Pylori Infection in the Development of Gastric Cancer—Review of the Literature Chirurgia 202411911010.21614/chirurgia.119.e C.297138657111 · doi ↗ · pubmed ↗
- 8Wu Z. Tang Y. Tang M. Wu Z. Xu Y. The relationship between the eradication of Helicobacter pylori and the occurrence of stomach cancer: An updated meta-analysis and systemic review BMC Gastroenterol.20252527810.1186/s 12876-025-03886-z 40259215 PMC 12010618 · doi ↗ · pubmed ↗
