# Protein structural domain-disease association prediction based on heterogeneous networks

**Authors:** Jingpu Zhang, Lianping Deng, Lei Deng

PMC · DOI: 10.1186/s12864-024-11117-0 · BMC Genomics · 2025-04-10

## TL;DR

This paper introduces a method to predict associations between protein structural domains and diseases using a network-based approach and machine learning.

## Contribution

A novel method using heterogeneous networks and XGBOOST to predict domain-disease associations with high accuracy.

## Key findings

- The XGBOOST model outperformed other algorithms with an AUC score of 0.94.
- Topological features based on meta-paths improved predictive performance.
- The method shows strong potential for identifying disease-related domains.

## Abstract

Domains can be viewed as portable units of protein structure, folding, function, evolution, and design. Small proteins are often found to be composed of only a single domain, while most large proteins consist of multiple domains for achieving various composite cellular functions. A dysfunction in domains may affect the function of proteins in some disease. Inferring the disease-related domains will help our understanding of the mechanism of human complex diseases.

In this study, we firstly build a global heterogeneous information network based on structural-based domains, proteins, and diseases. Then the topological features of the network are extracted according to the meta-paths between domain and disease nodes. Finally, we train a binary classifier based on the XGBOOST (eXtreme Gradient Boosting) algorithm to predict the potential associations between domains and diseases. The results show that the binary classification model using the XGBOOST algorithm performs significantly better than models using other machine learning algorithms, achieving an AUC (Area Under Curve) score of 0.94 in the leave-one-out cross-validation experiment.

We develop a method to build a binary classifier using the topological features based on meta-paths and predict the potential associations between domains and diseases. Based on its predictive performance in independent test sets, the method is proved to be powerful. Moreover, representing domains and diseases through integrating more multi-omic data will further optimize predictive performance.

The online version contains supplementary material available at 10.1186/s12864-024-11117-0.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11987217/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11987217/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC11987217/full.md

---
Source: https://tomesphere.com/paper/PMC11987217