A Multimodal Human Protein Embeddings Database: DeepDrug Protein Embeddings Bank (DPEB)
Md Saiful Islam Sajol, Magesh Rajasekaran, Hayden Gemeinhardt, Adam Bess, Chris Alvin, Supratik Mukhopadhyay

TL;DR
DPEB is a comprehensive database of human protein embeddings integrating structural, sequence, and contextual data, enabling improved protein interaction predictions and various biological applications.
Contribution
It provides the first integrated collection of multimodal protein embeddings, including AlphaFold2 neural network features, for enhanced computational modeling.
Findings
GraphSAGE with BioEmbedding achieved 87.37% AUROC in PPI prediction
DPEB enabled accurate enzyme and protein family classification
Supports multiple GNN methods for diverse biological analyses
Abstract
Computationally predicting protein-protein interactions (PPIs) is challenging due to the lack of integrated, multimodal protein representations. DPEB is a curated collection of 22,043 human proteins that integrates four embedding types: structural (AlphaFold2), transformer-based sequence (BioEmbeddings), contextual amino acid patterns (ESM-2: Evolutionary Scale Modeling), and sequence-based n-gram statistics (ProtVec]). AlphaFold2 protein structures are available through public databases (e.g., AlphaFold2 Protein Structure Database), but the internal neural network embeddings are not. DPEB addresses this gap by providing AlphaFold2-derived embeddings for computational modeling. Our benchmark evaluations show GraphSAGE with BioEmbedding achieved the highest PPI prediction performance (87.37% AUROC, 79.16% accuracy). The framework also achieved 77.42% accuracy for enzyme classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
