# Prediction of Protein-DNA Interface Hot Spots Based on Empirical Mode Decomposition and Machine Learning

**Authors:** Zirui Fang, Zixuan Li, Ming Li, Zhenyu Yue, Ke Li

PMC · DOI: 10.3390/genes15060676 · Genes · 2024-05-23

## TL;DR

This paper introduces EC-PDH, a new method that uses machine learning and signal decomposition to accurately predict hot spots in protein-DNA interactions.

## Contribution

The novel EC-PDH method combines empirical mode decomposition with CatBoost to improve hot spot prediction in protein-DNA interfaces.

## Key findings

- EC-PDH achieved an AUC of 0.847, MCC of 0.543, and F1 score of 0.772 on the test set.
- The method outperformed existing state-of-the-art approaches in identifying protein-DNA interface hot spots.
- Feature selection using mRMR-SFS reduced dimensions from 218 to 11 optimal features.

## Abstract

Protein-DNA complex interactivity plays a crucial role in biological activities such as gene expression, modification, replication and transcription. Understanding the physiological significance of protein-DNA binding interfacial hot spots, as well as the development of computational biology, depends on the precise identification of these regions. In this paper, a hot spot prediction method called EC-PDH is proposed. First, we extracted features of these hot spots’ solid solvent-accessible surface area (ASA) and secondary structure, and then the mean, variance, energy and autocorrelation function values of the first three intrinsic modal components (IMFs) of these conventional features were extracted as new features via the empirical modal decomposition algorithm (EMD). A total of 218 dimensional features were obtained. For feature selection, we used the maximum correlation minimum redundancy sequence forward selection method (mRMR-SFS) to obtain an optimal 11-dimensional-feature subset. To address the issue of data imbalance, we used the SMOTE-Tomek algorithm to balance positive and negative samples and finally used cat gradient boosting (CatBoost) to construct our hot spot prediction model for protein-DNA binding interfaces. Our method performs well on the test set, with AUC, MCC and F1 score values of 0.847, 0.543 and 0.772, respectively. After a comparative evaluation, EC-PDH outperforms the existing state-of-the-art methods in identifying hot spots.

## Full-text entities

- **Genes:** FEN1 (flap structure-specific endonuclease 1) [NCBI Gene 2237] {aka FEN-1, MF1, RAD2}, EGR1 (early growth response 1) [NCBI Gene 1958] {aka AT225, G0S30, KROX-24, NGFI-A, TIS8, ZIF-268}
- **Diseases:** EMD (MESH:D007859), ASA (MESH:D010534), injury to people or property (MESH:C000719191)
- **Chemicals:** EC-PDH (-), zinc (MESH:D015032), water (MESH:D014867), Hydrogen (MESH:D006859), K+ (MESH:D011188), amino acid (MESH:D000596)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** 293 — Homo sapiens (Human), Transformed cell line (CVCL_0045), 3Q8L. — Mus musculus (Mouse), Transformed cell line (CVCL_WI42)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11202800/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11202800/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/PMC11202800/full.md

---
Source: https://tomesphere.com/paper/PMC11202800