Learning the rules of peptide self-assembly through data mining with large language models
Zhenze Yang, Sarah K. Yorke, Tuomas P. J. Knowles, Markus J., Buehler

TL;DR
This study combines literature mining and machine learning to systematically analyze peptide self-assembly, creating a database and models that improve understanding and prediction of assembly behavior for various applications.
Contribution
It introduces a curated peptide assembly database and fine-tunes GPT models for literature mining, advancing systematic understanding of peptide self-assembly rules.
Findings
ML models achieve >80% accuracy in phase classification
Fine-tuned GPT outperforms pre-trained models in literature extraction
Workflow accelerates discovery of self-assembling peptides
Abstract
Peptides are ubiquitous and important biologically derived molecules, that have been found to self-assemble to form a wide array of structures. Extensive research has explored the impacts of both internal chemical composition and external environmental stimuli on the self-assembly behaviour of these systems. However, there is yet to be a systematic study that gathers this rich literature data and collectively examines these experimental factors to provide a global picture of the fundamental rules that govern protein self-assembly behavior. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining facilitated by a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Chemical Synthesis and Analysis · Advanced Proteomics Techniques and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Cosine Annealing · Adam · Attention Dropout · Multi-Head Attention · Residual Connection · Softmax · Byte Pair Encoding · Weight Decay
