# A study on a real-world data-based VTE risk prediction model for lymphoma patients

**Authors:** Changli He, Yin Wang, Han Zhang, Sitian Li, Fengjiao Kang, Fengqun Cai, Lizhu Han, Qinan Yin, Gang Li, Xuewu Song, Yuan Bian

PMC · DOI: 10.3389/fphar.2025.1691271 · 2025-10-14

## TL;DR

This study creates a machine learning model to predict venous thromboembolism risk in lymphoma patients using real-world data, aiming to improve early detection and treatment decisions.

## Contribution

A novel machine learning model for VTE risk prediction in lymphoma patients using real-world data and optimized techniques.

## Key findings

- The optimal model (Simp-SMOTE_rf_GBM) achieved an AUC of 0.954 in predicting VTE risk.
- Nine key predictors were identified, including anticoagulant use, D-dimer, and ECOG score.
- The model supports early VTE screening and risk stratification in clinical practice.

## Abstract

Patients diagnosed with malignant tumors exhibit a markedly elevated risk of venous thromboembolism (VTE), which has a negative impact on their prognosis. Currently, there is no reliable predictive model specifically for thrombosis risk in lymphoma patients. This study aims to develop and validate a machine learning model leveraging real-world data, offering a dependable risk assessment tool for the early identification of VTE in lymphoma patients.

We retrospectively analyzed 605 hospitalized patients with lymphoma between January 2019 and June 2024. Candidate predictors included demographic characteristics, comorbidities and medical history, tumor-related factors, treatment-related factors, and laboratory parameters. The primary endpoint was the occurrence of VTE within 6 months after hospitalization for confirmed lymphoma. Model development incorporated three imputation methods, three sampling strategies, three feature selection approaches, and nine machine learning algorithms. Predictive performance was compared across all models.

Combining different imputation, sampling, and feature selection strategies yielded 27 datasets, which were trained across nine algorithms to generate 243 models. The optimal model—Simp-SMOTE_rf_GBM, constructed using random forest imputation, SMOTE oversampling, and gradient boosting machine—achieved the highest predictive performance (AUC = 0.954). SHAP-based model interpretation identified nine key predictors ranked by importance: anticoagulant use, D-dimer, lactate dehydrogenase, central venous catheterization, carcinoembryonic antigen (CEA), Eastern Cooperative Oncology Group (ECOG) score, serum total protein (TP), total cholesterol (TC), and infectious disease.

This study established and validated a machine learning model for predicting VTE risk in lymphoma patients, with the optimal model demonstrating excellent discriminatory ability (AUC = 0.954). The model provides evidence to guide the timing and strategy of anticoagulation, supporting early VTE screening and risk stratification in clinical practice. Its implementation has important implications for improving patient outcomes and advancing public health.

## Linked entities

- **Diseases:** lymphoma (MONDO:0003659), venous thromboembolism (MONDO:0005399), infectious disease (MONDO:0005550)

## Full-text entities

- **Diseases:** infectious disease (MESH:D003141), lymphoma (MESH:D008223), thrombosis (MESH:D013927), malignant tumors (MESH:D009369), VTE (MESH:D054556)
- **Chemicals:** TC (-), cholesterol (MESH:D002784)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12558764/full.md

---
Source: https://tomesphere.com/paper/PMC12558764