Limitations of Sequence-Based Protein Representations for Parkinson's Disease Classification: A Leakage-Free Benchmark
C\'esar Jes\'us N\'u\~nez-Prado, Grigori Sidorov, Liliana Chanona-Hern\'andez

TL;DR
This study evaluates the effectiveness of protein sequence-based representations for Parkinson's disease classification, revealing limited discriminative power and emphasizing the need for more informative biological features.
Contribution
It provides a comprehensive, leakage-free benchmark of sequence-based protein representations for disease classification, highlighting their limitations and establishing a reproducible baseline.
Findings
Best model (ProtBERT + MLP) achieved F1 of 0.704 and ROC-AUC of 0.748.
Classical k-mer representations showed high recall but low precision.
Performance differences across representations were statistically insignificant.
Abstract
The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
