# BloodProST: prediction of blood-secretory proteins through self-training

**Authors:** Xuechen Mu, Long Xu, Zhenyu Huang, Jing Yan, Bocheng Shi, Yishi Wang, Binyue Liu, Kai Zhang, Ying Xu

PMC · DOI: 10.1093/bib/bbaf385 · Briefings in Bioinformatics · 2025-08-01

## TL;DR

BloodProST is a machine-learning framework that predicts blood-secretory proteins using self-training, improving accuracy without needing extensive manual annotations.

## Contribution

BloodProST introduces a self-training framework with unsupervised feature selection and a dual-pathway CNN-LSTM architecture for predicting blood-secretory proteins.

## Key findings

- BloodProST outperforms 14 state-of-the-art models in predicting blood-secretory proteins.
- The model's predictions are biologically relevant, validated by secretion markers like signal peptides.
- BloodProST generalizes well to other biofluids like urine.

## Abstract

Accurate identification of proteins secreted into the bloodstream is essential for discovering diagnostic biomarkers and therapeutic targets. A significant challenge is the scarcity of experimentally validated blood-secretory proteins, limiting labeled datasets required for robust model training. To address this issue, we propose BloodProST, a novel machine-learning framework leveraging a self-training strategy to reliably predict blood-secretory proteins. BloodProST iteratively expands the labeled dataset by generating high-confidence pseudo-labels from a large pool of unlabeled protein sequences, thereby progressively enhancing model predictions without continuous manual annotation. At its core, BloodProST incorporates an unsupervised feature selection module based on differential evolution, optimizing the Silhouette score to identify the most discriminative physicochemical and sequence-derived features. Additionally, BloodProST employs a dual-pathway convolutional neural network and long short-term memory (CNN)-(LSTM) architecture: a CNN-based pathway captures local information from pre-constructed features, whereas an LSTM-based pathway extracts high-level sequential dependencies directly from protein sequences. Furthermore, domain-specific biological priors, such as the expected proportion of secretory proteins, are integrated into the model’s loss function to guide training toward biologically plausible predictions. Extensive evaluation demonstrates that BloodProST significantly outperforms 14 state-of-the-art models across multiple metrics, achieving superior predictive accuracy, robustness, and interpretability. Validation analyses confirm the biological relevance of predictions through secretion-related markers (e.g. signal peptides and transmembrane regions) and demonstrate effective generalization to other biofluids, such as urine. Collectively, these results illustrate BloodProST’s potential as a versatile computational tool for secretion prediction and biomarker discovery across diverse biological fluids.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}, ETV3 (ETS variant transcription factor 3) [NCBI Gene 2117] {aka METS, PE-1, PE1}
- **Diseases:** Cancer (MESH:D009369)
- **Chemicals:** Ile (MESH:D007532), Val (MESH:D014633), Tyr (MESH:D014443), Phe (MESH:D010649), Leu (MESH:D007930), cysteine (MESH:D003545), GPI (MESH:D017261), amino acids (MESH:D000596), DE (-), Trp (MESH:D014364), Ala (MESH:D000409), Disulfide (MESH:D004220)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12315548/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12315548/full.md

## References

67 references — full list in the complete paper: https://tomesphere.com/paper/PMC12315548/full.md

---
Source: https://tomesphere.com/paper/PMC12315548