# From clinical phenotypes to genomic signatures: machine learning integration for precision tuberculosis treatment prediction

**Authors:** Liping Li, Huanqing Liu, Qian Lei, Tingting Li

PMC · DOI: 10.3389/fbinf.2026.1787360 · Frontiers in Bioinformatics · 2026-03-03

## TL;DR

This study creates a high-precision model combining clinical and genomic data to predict which tuberculosis patients are at high risk of treatment failure, helping guide personalized treatment.

## Contribution

The novel contribution is an ensemble machine learning model integrating clinical and transcriptomic data for precise TB treatment prediction.

## Key findings

- An ensemble model using clinical and genomic data achieved an AUC of 0.986, outperforming models using only clinical or genomic data.
- Key predictors included CRP, DNA repair genes, interferon response pathways, age, and BMI.
- The model was externally validated with an AUC of 0.972, showing strong generalization.

## Abstract

Tuberculosis (TB) remains a major global health threat, causing approximately 1.5 million deaths each year. Despite progress in treatment, 15%–20% of patients still experience treatment failure or relapse, highlighting the urgent need for precise predictive tools for early identification of high-risk patients. Current methods based on clinical parameters have limitations in prediction accuracy and revealing potential biological mechanisms.

This study developed and validated an innovative multi-omics integration prediction model. We retrospectively collected clinical data from 467 tuberculosis patients and integrated transcriptomic data from three independent public cohorts (GSE19491, GSE31312, GSE83456), involving 3,240 differentially expressed genes. Through advanced feature engineering and bioinformatics analysis, key features were selected. We systematically evaluated 12 machine learning algorithms and adopted an ensemble learning strategy to construct the final model. Model performance was evaluated through strict cross-validation and prospective validation cohorts.

Clinical data analysis identified age, body mass index (BMI), and C-reactive protein (CRP) levels as significant predictors of treatment response. Transcriptomic analysis revealed 1,247 differentially expressed genes between responders and non-responders, enriched in immune response and metabolic pathways. Among the tested algorithms, the ensemble model based on Extra Trees performed the best, with an area under the curve (AUC) of 0.986, significantly superior to models using only clinical data (AUC = 0.850) or only genomic data (AUC = 0.820). Feature importance analysis confirmed CRP, specific gene features (such as DNA repair and interferon response pathways), age, and BMI as the most important predictors. External validation confirmed the model’s robustness (AUC = 0.972).

This study successfully developed a high-precision prediction model integrating clinical and genomics data, capable of early identification of high-risk patients with poor treatment response. The model demonstrates excellent prediction performance and generalization ability, providing a powerful tool for moving towards tuberculosis precision medicine, guiding individualized treatment strategies to improve patient prognosis and control the spread of drug resistance.

https://www.chictr.org.cn/, ChiCTR2300074328, 03/08/2023.

## Linked entities

- **Diseases:** tuberculosis (MONDO:0018076)

## Full-text entities

- **Genes:** CRP (C-reactive protein) [NCBI Gene 1401] {aka PTX1}
- **Diseases:** deaths (MESH:D003643), TB (MESH:D014376)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12993280/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12993280/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12993280/full.md

---
Source: https://tomesphere.com/paper/PMC12993280