Leveraging Retrieval-Augmented Generation and Large Language Models to Predict SERCA-Binding Protein Fragments from Cardiac Proteomics Data
Taylor A Phillips, Alejandro W. Huskey, Patrick T. Huskey, Seth L., Robia, and Peter M. Kekenes-Huskey

TL;DR
This study demonstrates that large language models, combined with retrieval-augmented generation, can predict SERCA-binding protein fragments from limited cardiac proteomics data, aiding in protein function annotation.
Contribution
The paper introduces a novel approach using LLMs and RAG to predict protein interactions from small datasets, enhancing proteomics analysis capabilities.
Findings
LLMs can classify SERCA-binding fragments with limited data
Prompt tuning improves classification accuracy
Identified novel ER-localized proteins with potential SERCA interaction
Abstract
Large language models (LLMs) have shown promise in various natural language processing tasks, including their application to proteomics data to classify protein fragments. In this study, we curated a limited mass spectrometry dataset with 1000s of protein fragments, consisting of proteins that appear to be attached to the endoplasmic reticulum in cardiac cells, of which a fraction was cloned and characterized for their impact on SERCA, an ER calcium pump. With this limited dataset, we sought to determine whether LLMs could correctly predict whether a new protein fragment could bind SERCA, based only on its sequence and a few biophysical characteristics, such as hydrophobicity, determined from that sequence. To do so, we generated random sequences based on cloned fragments, embedded the fragments into a retrieval augmented generation (RAG) database to group them by similarity, then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Proteomics Techniques and Applications · Machine Learning in Bioinformatics · Bioinformatics and Genomic Networks
