Atom-by-atom protein generation and beyond with language models
Daniel Flam-Shepherd, Kevin Zhu, Al\'an Aspuru-Guzik

TL;DR
This paper demonstrates that chemical language models can generate proteins atom-by-atom, enabling the design of novel proteins and biomolecules beyond the constraints of the genetic code, including unnatural amino acids and protein-drug conjugates.
Contribution
It introduces a method for atom-level protein generation using language models, bridging chemical and biological representations for advanced biomolecular design.
Findings
Language models can generate proteins with atom-level detail.
Models can create proteins with unnatural amino acids.
Simultaneous exploration of chemical and protein space is possible.
Abstract
Protein language models learn powerful representations directly from sequences of amino acids. However, they are constrained to generate proteins with only the set of amino acids represented in their vocabulary. In contrast, chemical language models learn atom-level representations of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical language models can learn atom-level representations of proteins enabling protein generation unconstrained to the standard genetic code and far beyond it. In doing so, we show that language models can generate entire proteins atom by atom -- effectively learning the multiple hierarchical layers of molecular information that define proteins from their primary sequence to their secondary, and tertiary structure. We demonstrate language models are able to explore beyond protein space -- generating proteins with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Protein Structure and Dynamics · Machine Learning in Bioinformatics
