Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files
Daniel Flam-Shepherd, Al\'an Aspuru-Guzik

TL;DR
This paper demonstrates that language models trained on 3D chemical file formats can directly generate molecules, materials, and protein binding sites in three dimensions without architectural modifications, matching state-of-the-art performance.
Contribution
It shows that language models can generate 3D molecular structures directly from raw file formats, expanding the scope beyond graph-based representations.
Findings
Language models trained on XYZ, CIF, and PDB files can generate valid 3D structures.
Performance is comparable to specialized graph-based and domain-specific models.
No architectural modifications are needed for 3D structure generation.
Abstract
Language models are powerful tools for molecular design. Currently, the dominant paradigm is to parse molecular graphs into linear string representations that can easily be trained on. This approach has been very successful, however, it is limited to chemical structures that can be completely represented by a graph -- like organic molecules -- while materials and biomolecular structures like protein binding sites require a more complete representation that includes the relative positioning of their atoms in space. In this work, we show how language models, without any architecture modifications, trained using next-token prediction -- can generate novel and valid structures in three dimensions from various substantially different distributions of chemical structures. In particular, we demonstrate that language models trained directly on sequences derived directly from chemical file…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Protein Structure and Dynamics
