# Research on Plant RNA-Binding Protein Prediction Method Based on Improved Ensemble Learning

**Authors:** Hongwei Zhang, Yan Shi, Yapeng Wang, Xu Yang, Kefeng Li, Sio-Kei Im, Yu Han

PMC · DOI: 10.3390/biology14060672 · Biology · 2025-06-10

## TL;DR

This paper introduces a new method using machine learning to accurately predict RNA-binding proteins in plants, which helps researchers better understand plant gene regulation.

## Contribution

The novel approach combines shallow and deep learning techniques with improved feature encoding for predicting plant RNA-binding proteins.

## Key findings

- The method achieved 97.20% accuracy on a benchmark dataset of 4992 sequences.
- On an independent dataset of 1086 sequences, it reached 99.72% accuracy, outperforming existing methods by over 12 percentage points.

## Abstract

Plants rely on special proteins called RNA-binding proteins, to control their genes, guiding their growth and development. Identifying these proteins is challenging and slowing down plant research. Our research proposes an effective computational method to find these proteins by studying their patterns, like decoding a puzzle. We merged various learning techniques to study 4992 plant proteins, achieving an impressive 97.20% accuracy in tests, and even hit 99.72% on a separate set of 1086 proteins, surpassing other methods. Our method accurately identifies RNA-binding proteins that control plant gene, making it easier to study how plants grow and develop. This useful tool helps researchers explore plant biology, advancing research into plant genetics. By improving our understanding of gene regulation, our work supports discoveries that benefit plant science.

(1) RNA-binding proteins (RBPs) play a crucial role in regulating gene expression in plants, affecting growth, development, and stress responses. Accurate prediction of plant-specific RBPs is vital for understanding gene regulation and enhancing genetic improvement. (2) Methods: We propose an ensemble learning method that integrates shallow and deep learning. It integrates prediction results from SVM, LR, LDA, and LightGBM into an enhanced TextCNN, using K-Peptide Composition (KPC) encoding (k = 1, 2) to form a 420-dimensional feature vector, extended to 424 dimensions by including those four prediction outputs. Redundancy is minimized using a Pearson correlation threshold of 0.80. (3) Results: On the benchmark dataset of 4992 sequences, our method achieved an ACC of 97.20% and 97.06% under 5-fold and 10-fold cross-validation, respectively. On an independent dataset of 1086 sequences, our method attained an ACC of 99.72%, an F1score of 99.72%, an MCC of 99.45%, an SN of 99.63%, and an SP of 99.82%, outperforming RBPLight by 12.98 percentage points in ACC and the original TextCNN by 25.23 percentage points. (4) Conclusions: These results highlight our method’s superior accuracy and efficiency over PSSM-based approaches, enabling large-scale plant RBP prediction.

## Full-text entities

- **Genes:** SUGP1 (SURP and G-patch domain containing 1) [NCBI Gene 57794] {aka F23858, RBP, SF4}

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12189372/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12189372/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/PMC12189372/full.md

---
Source: https://tomesphere.com/paper/PMC12189372