Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering
Shujian Jiao, Bingxuan Li, Lei Wang, Xiaojin Zhang, Wei Chen, Jiajie, Peng, Zhongyu Wei

TL;DR
This paper introduces a novel protein modeling approach that enhances the ESM2 framework by integrating protein family classification and clustering techniques, resulting in improved global and local protein representations for biological analysis.
Contribution
It combines community clustering with a contextual prediction task to significantly improve protein representations beyond existing models like ESM2.
Findings
Achieved state-of-the-art results in downstream protein tasks
Enhanced global protein representations through clustering
Improved amino acid prediction accuracy
Abstract
Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.Our study addresses this gap by incorporating protein family classification into ESM2's training.This approach, augmented with Community Propagation-Based Clustering Algorithm, improves global protein representations, while a contextual prediction task fine-tunes local amino acid accuracy. Significantly, our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Machine Learning in Bioinformatics · Microbial Metabolic Engineering and Bioproduction
