Peptide Sequencing Via Protein Language Models
Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J., Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla,, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo,, Jacob M. Luber

TL;DR
This paper presents a transformer-based model that predicts complete peptide sequences from limited amino acid data, improving accuracy and biological relevance in proteomics analysis.
Contribution
It introduces a novel approach combining simulated experimental data with a protein language model to enhance peptide sequencing accuracy.
Findings
Achieves up to 90.5% accuracy with four known amino acids.
Validates predictions using AlphaFold structural assessments.
Demonstrates potential for evolutionary and proteomic analysis.
Abstract
We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids. To date, protein sequencing relies on mass spectrometry, with some novel edman degregation based platforms able to sequence non-native peptides. Current protein sequencing techniques face limitations in accurately identifying all amino acids, hindering comprehensive proteome analysis. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify in protein sequences from the UniRef database. This targeted masking mimics real-world sequencing limitations. We then modify and finetune a ProtBert derived transformer-based model, for a new downstream task predicting these masked residues, providing an approximation of the complete sequence. Evaluating on three bacterial Escherichia…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Bioinformatics · Wikis in Education and Collaboration
MethodsSparse Evolutionary Training · AlphaFold
